Pplications of



Download 485.76 Kb.
View original pdf
Page1/2
Date13.02.2018
Size485.76 Kb.
  1   2


P
RACTICAL
A
PPLICATIONS OF
B
IOLOGICAL
R
EALISM IN
A
RTIFICIAL
N
EURAL
N
ETWORKS
by
T
EGAN
M
AHARAJ
A thesis submitted to the
Department of Computer Science in conformity with the requirements for the degree of Master of Science
Bishop’s University
Sherbrooke, Quebec, Canada
November 2015
Copyright © Tegan Maharaj, 2015

i
Abstract
Over the last few decades, developments in structure and function have made artificial neural networks (ANNs) state-of-the-art for many machine learning applications, such as self-driving cars, image and facial recognition, speech recognition etc. Some developments (such as error backpropagation) have no obvious biological motivation, while others (such as the topology of convolutional neural networks) are directly modeled on biological systems. We review biological neural networks with a phenomenological approach to establish a framework for intuitively comparing artificial and biological neural networks based on ‘information’ and ‘actions’ which can be either positive or negative.
Based on this framework, we suggest interpreting the activation of a neuron as an amount of neurotransmitter, and weights as a number of receptors, and draw a connection between Spike-Timing-Dependent Plasticity (STDP) and backpropgation. We apply our framework to explain the biological plausibility of several empirically successful innovations in ANNs, and to propose a novel activation function which employs negative rectification. We demonstrate that a network with such negative rectifier units (NegLUs) can represent temporally- inverted STDP (ti-STDP). NegLUs are tested in a convolutional architecture on the MNIST and CIFAR benchmark datasets, where they are found to help prevent overfitting, consistent with the hypothesized role of ti-STDP in deep neural networks. We suggest applications of these networks, and extensions of our model to form Min-Maxout networks and learn activation functions.

ii
Acknowledgements
I would first like to thank my supervisor, L. Bentabet, for his good-natured humour, encouragement, and invaluable intuitive explanations. His pragmatic guidance fostered independence and critical thinking, and kept me grounded in a field of exciting and sometimes overwhelming complexity. I also owe many thanks to N. Khouzam and S. Bruda, for their excellent teaching, open doors, and years of practical and academic advice.
During my thesis, I had the wonderful opportunity of being offered a Mitacs research internship at iPerceptions Inc., which enabled me to explore data science and machine learning in the real wold. I am very grateful to M. Butler,
L. Cochrane, and M. Tremblay for making this valuable experience rewarding and enjoyable. I would also like to thank J. Fredette, E. Pinsonnault, and S. Cote at the Research Office, for helping make this and many other great opportunities accessible.
I am forever grateful to S. Stoddard and my friends and co-workers at the ITS
Helpdesk, without whom I might never have discovered computer science, and equally to my friends and fellow students in J9 and J113, for making it feel like home.
Finally, I could never offer sufficient thanks to K. Gill, G. Guest, S. Matheson, and my parents, for the years of love and support that made everything possible.

iii
Contents
Abstract ................................................................................................................................................... i
Acknowledgements ................................................................................................................................ ii
Contents ................................................................................................................................................ iii
List of Tables ......................................................................................................................................... v
List of Figures and Illustrations ......................................................................................................... vi
Chapter 1:
Introduction ................................................................................................................... 1 1.1
Motivation .............................................................................................................................. 1 1.2
Contributions of this work .................................................................................................... 2 1.3
Publications and software .................................................................................................... 3 1.4
Structure of the paper ........................................................................................................... 4
Chapter 2:
Background .................................................................................................................... 6 2.1
Intelligence, learning, and memory ..................................................................................... 6 2.2
Neural network basics .......................................................................................................... 8 2.3
Biological neural networks ................................................................................................... 9 2.3.1
Biological neurons ......................................................................................................... 9 2.3.2
Biological neural networks ......................................................................................... 13 2.3.3
Biochemical systems capable of learning .................................................................. 16 2.3.4
Biological learning and memory ................................................................................. 17 2.3.5
Types of learning ......................................................................................................... 19 2.4
Artificial neural networks .................................................................................................. 22 2.4.1
Artificial Intelligence .................................................................................................. 22 2.4.2
Machine learning ........................................................................................................ 24 2.4.3
Artificial neurons ........................................................................................................ 25 2.4.4
Artificial neural networks .......................................................................................... 28 2.4.5
Deep learning .............................................................................................................. 31 2.4.6
Deep belief networks ................................................................................................... 32 2.4.7
Convolutional neural networks (convnets) ................................................................ 33 2.4.8
Rectified linear units ................................................................................................... 33 2.4.9
Dropout and DropConnect .......................................................................................... 34 2.4.10
Maxout ........................................................................................................................ 35
Chapter 3:
Comparing biological and artificial neural networks ............................................... 36 3.1
Introduction ......................................................................................................................... 36 3.1.1
Traditional intuition ................................................................................................... 36

iv
3.1.2
Inhibition and excitation ............................................................................................ 37 3.1.3
Accidental bias ............................................................................................................ 40 3.2
Proposed biological intuition for artificial neurons ........................................................... 42 3.3
Spike-timing dependent plasticity terminology ................................................................ 46 3.4
Related work ........................................................................................................................ 47 3.4.1
Biological influence in artificial neural networks ..................................................... 47 3.4.2
Other interpretations of the activation function ....................................................... 47 3.5
Application of proposed biological intuition ...................................................................... 48 3.5.1
Hyperbolic tangent and 0-mean centred activation functions ................................. 48 3.5.2
Boltzmann machines and deep belief networks (DBNs) .......................................... 49 3.5.3
Convolutional neural networks .................................................................................. 50 3.5.4
ReLUs........................................................................................................................... 50 3.5.5
Dropout, DropConnect, and Maxout .......................................................................... 51 3.5.6
Input normalization .................................................................................................... 52 3.5.7
Unsupervised pre-training ......................................................................................... 53
Chapter 4:
Negative rectifiers ....................................................................................................... 54 4.1
Introduction ......................................................................................................................... 54 4.2
Formal definition ................................................................................................................. 56 4.2.1
Universal approximation ............................................................................................ 56 4.3
Related work ........................................................................................................................ 57 4.3.1
Dynamic field model of vision ..................................................................................... 57 4.3.2
Modified ReLUs ........................................................................................................... 58 4.3.3
Other non-monotonic activation functions ................................................................ 59 4.4
Materials and methods ....................................................................................................... 60 4.4.1
The MNIST dataset ..................................................................................................... 61 4.4.2
The CIFAR-10 dataset ................................................................................................ 61 4.4.3
Network architectures for MNIST ............................................................................. 63 4.4.4
Network architectures for CIFAR-10 ......................................................................... 63 4.5
Results and discussion ........................................................................................................ 64 4.5.1
MNIST ......................................................................................................................... 64 4.5.2
CIF AR-10 .................................................................................................................... 68 4.6
Discussion ............................................................................................................................ 71
Chapter 5:
Conclusions .................................................................................................................. 73 5.1
Summary discussion ........................................................................................................... 73 5.2
Future directions ................................................................................................................. 75
References ............................................................................................................................................ 77

v
List of Tables
Table 1: Summary of possible causes and effects in spike-timing dependent plasticity (STDP) ... 21
Table 2: Conventional view of interpretations of neural network properties ................................. 42
Table 3: Suggested biological interpretations of neural network properties .................................. 42
Table 4: Summary of possible effects at a synapse when weights and activations can be negative
............................................................................................................................................... 43
Table 5: Proposed terminology for causes and effects in spike-timing dependent plasticity (STDP)
............................................................................................................................................... 46
Table 6: Performance (error) of MixLU-LeNet networks with increasing numbers of NegLUs .... 64
Table 7: Test and training error results for various MixLU-LeNets, showing that a network with
4% NegLUs achieves the largest increase in training error for smallest increase in test error ....................................................................................................................................... 66
Table 8 : Error of the ReLU-LeNet and MixLU-LeNet with 1% NegLUs for the first 9 training epochs, highlighting that the MixLU net finds low-error regions faster, but then leaves them ....................................................................................................................................... 68
Table 9: Performance (accuracy) of MixLU-LeNet5 networks with increasing numbers of NegLUs
............................................................................................................................................... 69
Table 10: Test, training, and difference of test and training accuracies for various MixLU-
LeNet5s, showing that the network with 0.5% NegLUs was able to achieve a reduction in training accuracy with no drop in test accuracy. ........................................................... 69

vi
List of Figures and Illustrations
Figure 1: A cartoon biological neuron. Modified from (Health 2015) .............................................. 10
Figure 2: The categorization of neurons based on axon/dendrite configuration is shown at left modified from (HowStuffWorks 2001), while a more realistic continuum of characteristics, and therefore, connectivity, is shown at right modified from
(Stuffelbeam 2008) .............................................................................................................. 11
Figure 3: Chemical transmission of an action potential. From (Cummings 2006) ......................... 12
Figure 4: Hypothesized evolution of the modern synapse from (Ryan and Grant 2009) ............... 14
Figure 5: A ‘network’ of two input perceptrons, x1 and x2, and one output, y, can perform a linear separation between 0s and 1s (an and gate) if it has learned appropriate weights ....... 26
Figure 6: An artificial neural network with 1 hidden layer set up to classify the digits 0-9 in a
784-pixel image ................................................................................................................... 30
Figure 7: From (Goodfellow, et al. 2013), demonstrating how a maxout unit can act as a rectifier or absolute value function, or approximate a quadratic function. The authors note this is only a 2-D case for illustration; the concept extends equally well to arbitrary convex functions. ............................................................................................................................. 35
Figure 8: The logistic sigmoid, hyperbolic tangent, rectified linear, and negative-linear activation functions .............................................................................................................................. 55
Figure 9: Derivative of the negative-linear rectifier ......................................................................... 56
Figure 10: A random sample of MNIST training examples ............................................................. 61
Figure 11: CIFAR-10 classes with a random sample of images for each ........................................ 62
Figure 12: From left to right, top to bottom; performance on MNIST of MixLU-LeNet networks with 1, 2, 3, 4, 5 and 6% NegLUs in the fully-connected layer. Note that training error increases while validation error does not until
6%. ....................................................... 67
Figure 13: From left to right, top to bottom; performance on CIFAR-10 of MixLU-LeNet5 networks with 0, 0.25, 0.5, 1, 2, and 3% NegLUs in the fully-connected layers. Note that training error approaches test error quite closely without a large decrease in error, up to 2% NegLUs ................................................................................................................ 70


1
Chapter 1:
Introduction
1.1 Motivation
What is intelligence, and how does it work? The answer to this question is long, multifaceted, and as yet incomplete, built over the years from research in philosophy, biology, chemistry, psychology, mathematics, statistics, computer science, information theory, and many other disciplines.
Insects and invertebrates perform complex motor movements, search for and remember paths to food resources, communicate socially, and build elaborate structures. Single-celled organisms can navigate, find food, learn, and remember. The average human can read many textbooks worth of information in a lifetime, create beautiful things, maintain complex social relationships, and
(sometimes) even do their own taxes.
Our understanding of the physical processes which give rise to these behaviours is growing every day, and has already allowed us to imitate many intelligent behaviors in useful technologies. From self-driving cars and drones to disease therapies and ecological conservation, it is an understatement to say there are many applications for artificial intelligence. Understanding robust problem- solving, selective filtering, long-term memory, self-directed learning, and creative pattern recognition in a practical, concrete way may seem an insurmountable task, but we have many of the building blocks in place.


2
This work reviews our current scientific understanding of intelligence, learning, and memory in biological systems as an aeronautical engineer looks at birds, bats, and flying insects – not with the intention of duplicating them exactly, but of taking functional inspiration and intuition to build better, more useful artificially intelligent systems.
1.2 Contributions of this work
Most of the empirical work on which the classic artificial neuron abstraction is based examines excitatory neurons. We summarize research on the role of inhibition, which we demonstrate is particularly relevant for deep, hierarchical networks.
It is generally stated that the activation of a neuron in an artificial neural network represents its firing rate. Although this idea is not without merit, we suggest that it is more accurate to think of the activation as a quantity, specifically of neurotransmitter, which as an input value is scaled by the number of receptors available for binding that neurotransmitter (i.e., the effective synaptic strength), and whose effect can be modified by the sign
(inhibitory/excitatory). We demonstrate that thinking of the activation in this way allows for a more biologically accurate modeling of inhibitory and excitatory behaviours, which increases the representational capacity of artificial neurons.


3
We apply this line of thinking to introduce negative-rectified linear units (which we call NegLUs, in the fashion of ReLU for rectified linear unit), whose behaviour we use to investigate the utility of inhibitory, non-monotonic activation functions in deep networks. We show that they have the potential to reduce overfitting in convolutional architectures, and show how they allow the network to perform temporally-inverted anti-Hebbian learning. We suggest that this method, or generalizations of it that we call Min-Maxout networks and
ConvAct networks could be useful for exploring deep architectures.
1.3 Publications and software
A preliminary version of a neural network with activation-function modifiers imitating the behaviour dynamics of four neurotransmitters was tested on
MNIST, and these results were presented in a poster:
Tegan Maharaj. Introducing “neurotransmitters” to an artificial neural network for modular concept learning and more accurate classification. Bishop’s
University Research Week, 2014.
A new multi-type layer type was created for the MATLAB package MatConvNet
(Vedaldi and Lenc 2015), along with the files necessary for implementing negative-linear activations, and made publically available at https://github.com/teganmaharaj/neglu


4
Two papers based on this work have been prepared for submission to peer- reviewed journals:
Tegan Maharaj. Activation as a quantity of neurotransmitter: Lessons learned from comparing biological and artificial neural networks. International
Conference on Learning Representations (ICLR), 2016.
Tegan Maharaj. Negative rectifiers reduce overfitting by modeling temporally- inverted anti-Hebbian learning. Advances in Neural Information Processing
Systems (NIPS), 2016.
1.4 Structure of the thesis
The background section first considers the nature of intelligence, and formal work which links intelligence to statistical physics and information theory. We present this section first, along with some basic terminology for both biological and artificial neural networks, in order to provide a context for thinking about the following sections.
We cover biological neurons and neural networks, with a focus on their evolution, leading us to consider the biochemical reactions which permit intelligent behaviours. We discuss intelligence, learning, and memory in biological systems, and the processes hypothesized to underlie these phenomena: long-term potentiation and long-term depression, via various forms of spike- timing-dependent plasticity (STDP).


5
We then look at artificial neural networks, in the context of machine learning and artificial intelligence, and examine recent, empirically successful innovations in deep networks such as rectified-linear activations, Maxout networks, Network-in-Network architectures, and DropConnect.
The next section contrasts biological and artificial neural networks, and suggests some modifications to the common biology-based intuition for describing artificial neural networks. Based on this framework developed by this comparison, we examine the biological relevance of the artificial neural network innovations discussed earlier. We describe how backpropagation in artificial neural networks relates to learning in biological neural networks, and based on spike-timing-dependent plasticity research in biological networks, suggest that non-monotonic and monotonically-decreasing activation functions may be relevant for learning in deep artificial neural networks.
The last section tests this hypothesis by implementing a novel activation function employing negative-linear rectification in a deep convolutional network.
This network is tested on two common image recognition benchmark datasets,
MNIST and CIFAR. Results are presented, discussed, and compared with relevant other work. A generalization of NegLUs to Min-Maxout networks, and to learning activation functions in ConvAct networks is suggested. We conclude with implications for useful deep neural networks and learning theory, with avenues for future research.


6
Chapter 2:
Background
2.1 Intelligence, learning, and memory
The concept of intelligence is an intuitive one for most people, but has proved difficult to rigorously define without a full understanding of its mechanisms. The
Oxford dictionary defines intelligence as “the ability to acquire and apply knowledge and skills” (2015), and the Miriam-Webster as “the ability to learn or understand or to deal with new or trying situations: reason; also: the skilled use of reason: the ability to apply knowledge to manipulate one's environment or to think abstractly as measured by objective criteria (as tests) … the act of understanding: comprehension” (Mirriam-Webster 2015). Many psychological and artificial intelligence definitions include aspects of planning and goal- directed behaviour, and also emphasize the definition of intelligence as dependent on its measurement, in the form of tests of some kind. Legg and
Hutter review a variety of definitions in (Legg and Hutter 2006), and conclude by adopting the following definition: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments”. The idea of “intelligence as compression”, or the ability to encode information in a cost-efficient way, without
(or with minimal) loss of information, has also been suggested as a definition of artificial intelligence (Legg and Hutter 2006) (Dowe, Hernandex-Orallo and Das
2011). Relating these two definitions would suggest that the ability of a system


7
to achieve goals in a wide range of environments can be measured by the extent to which it is able to efficiently compress information.
As noted by Legg and Hutter, these definitions leave learning, memory, consciousness, and adaptation as implicit components (or mechanisms) of intelligence. This doesn’t seem to help us much in actually making useful intelligent systems, but this notion of intelligence as involving agents acting to minimize energy is useful because it gives us a natural connection to the powerful language of statistical physics and thermodynamics. This language describes overall or emergent phenomena of a system without needing to know too much about the states of individual particles in the system, simply based on the concept that entropy increases over time. A full review of these connections is outside the scope of this work, but it can be instructive to keep in mind the idea of intelligence as a way to minimize energy expended while maximizing internal order. This has a nice connection to the evolution of biological networks, and also to the optimization-based concept of minimizing a cost function, and sets the stage for viewing biological and artificial neural networks as different implementations of the same behaviours. For an introduction to the concepts of statistical physics in the context of artificial neural networks, see
(Engel and Broeck 2001) (Hopfield 1982), and for some possible applications to biological systems, see (England 2013).
Before proceeding with a discussion of learning and memory, we will first look at the structures which appear to make it possible – neural networks.


8 2.2 Neural network basics
Over the years many criticisms have been leveled at researchers in artificial intelligence for using biological terms loosely or incorrectly. Artificial intelligence research has also weathered several “AI Winters”
1
with subsequent revivals, each complete with new terminology to shed negative perceptions of the past, and somewhat independently of AI research in general, artificial neural networks more specifically have fallen in and out of favour and practical importance.
Research on biological intelligence comes from a mix of psychology, computational neuroscience, sociology, evolutionary biology, and other fields, and often considers many factors other than neurons. These factors include the influence other cell types like glia; time; temperature; chemical gradients and other physiological phenomena; the macroscopic organization of the brain into regions and hemispheres; genetics; and much more.
Thus the term ‘neural network’ is outdated and somewhat uncomfortable for both disciplines. However, its use is entrenched, particularly in computer science, and it does lend itself to intuitive explanations and connections with network theory. With this in mind, this work defines a neural network as a network whose interconnected nodes (called neurons or units) each perform
1
“AI Winters” are periods of time where artificial intelligence research was unpopular after falling short of the often-fantastical predictions of its capabilities. For a brief history of artificial intelligence research, see
(Buchanan 2006)


9
some small computation on an input or inputs, which allows the network as a whole to perform some emergent function (such as classification, or memory).
For a more detailed treatment of complex networks, see (Albert and Barabasi
2002). This paper distinguishes biological neural networks as networks based on biological neurons which have evolved in living creatures from artificial neural networks, a machine learning tool built on an abstracted neuron. We favour the somewhat archaic use of “artificial neural network” over the more common
“neural network” in order to make this distinction, and maintain the term
“neural network” as an origin- and application-agnostic term which encompasses both.
2.3 Biological neural networks
2.3.1 Biological neurons
The nervous system is composed of primarily two cell types: neurons, which are specialized for establishing and passing on electrochemical signals, and glial cells. Glial cells have classically been considered as a passive substrate for the more active work of neurons. This view is changing, with recent work recognizing the importance of glial cells in modulating neuronal behaviour – see
(Fields and Stevens-Graham 2002) (Magistretti 2006) (Verkhratsky 2006). Glial cells and neurons differentiate from a common cell type early in development but share a lot of morphological characteristics. Glia are specialized for modulating the neuronal environment, providing nutrients, cleaning up intracellular


10
particles, insulating axons, changing ion gradients, and connecting far-apart neurons. (Kandel, et al. 2012). For simplicity, this work focuses on neurons as the main unit of biological neural networks, but aims to capture some of the behaviour of glia in the behaviour of abstracted biological neurons.
Like other cells, neurons have a cell body (the main part, which performs the functions that keep the cell alive) with a nucleus (where DNA, the genetic information is stored). Although neurons vary widely in their morphology, most have an axon (a long, insulated part for propagating signal over longer distances) and dendrites (branched protrusions) which meet the branches of other cells to form synapses – see Figure 1.
Figure 1: A cartoon biological neuron. Modified from (Health 2015)
This means that the connectivity of neurons in a biological neural network is heavily determined by the neuron’s size and structure, as well as their dendrites astrocyte
(glial cell) axon oligodendrocyte
(glial cell) nucleus cell body axon terminals synapse


11
interaction with glial cells. As seen in Figure 2, neurons are often grouped by their morphology, but in reality these categorizations are only a guide, and in the brain we see a continuum of morphological characteristics.
Figure 2: The categorization of neurons based on axon/dendrite configuration is shown at left modified from (HowStuffWorks 2001), while a more realistic continuum of characteristics, and therefore, connectivity, is shown at right modified from (Stuffelbeam 2008)
Neurons communicate primarily in two ways; electrically and chemically. The cell membranes of neurons have proteins which can selectively transport ions.
The unequal distribution of – and + ions on either side of the membrane creates a voltaic potential. When a neuron is electrically stimulated, the membrane of the axon releases ions, which causes nearby release of ions, and this release proceeds down the axon in a “wave” of depolarization. Some neurons are connected to others directly by membrane proteins by so-called gap junctions, so that the wave of depolarization proceeds directly to those neurons as well, forming an electrical synapse. At the axon terminals, the ion release triggers small capsules of chemicals called neurotransmitters to be released into the synapse, where they may interact with receptors on the next neuron, which


12
could trigger something like ion transport which could in turn trigger a wave of depolarization in the next neuron, and so on. See Figure 3 for a summary of the processes involved at these chemical synapses.
Figure 3: Chemical transmission of an action potential. From (Cummings 2006)
Note that at gap junctions the electrical signal is propagated directly, with no
“computation” by the neuron, while the nature of neurotransmitters and receptors at the synapse allow for complex interactions between neurons. There are over a hundred neurotransmitters in the human brain, and they will have different effects depending on the receptor that receives them, so that the same neurotransmitter can have different effects at different synapses. Neurons may release more than one neurotransmitter, even at one synapse.
Neurotransmitters can also affect one another’s structure, binding action, or


13
reuptake rate; the availability and number of receptors at other neurons, and much more. This creates a system with enormous representational capacity and very complex dynamics. We only skim the surface, looking at broad functional consequences of this diversity. See (Kandel, et al. 2012) or (Purves, et al. 2001) for good introductory textbooks.
2.3.2 Biological neural networks
We begin this section looking at a human-like nervous system, and trace its evolution in an attempt to discover what it is about this system that seems to make for such unique behaviour, and we find that the mechanisms for intelligent behaviour may not be as unique as they appear.
The nervous system consists of the brain and spinal cord (the central nervous system, or CNS), as well as all the neurons that go from there to various other parts of the body - motor neurons which send impulses to muscles causing them to move, and sensory neurons which are stimulated by touch/light/chemicals etc. and relay those stimuli to the brain. The brain has two connected hemispheres, and sensory information is generally ‘crossed’ – that is, sensory information from the left side of the body goes to the right side of the brain, and vice-versa, with an overlap in the middle. Although both sides process and combine similar information, in different species different hemispheres and regions can show dominance for certain tasks, for example, in humans, language comprehension tends to occur mostly in the left hemisphere (Kandel, et al. 2012).


14
This nervous system layout is a consequence of a number of evolutionary developments. In single-celled organisms, signalling proteins embedded in the cell membrane provide the cell with a mediated interaction with the external environment, and many of these proteins are perform synapse-like functions, as well as being essential elements of animal synapses – for example, calcium transporters and protein kinases found in yeast and amoeba are vital to synaptic transmission in animals (Ryan and Grant 2009). The hypothesized evolution of the synapse is shown in Figure 4.
Figure 4: Hypothesized evolution of the modern synapse from (Ryan and Grant 2009)
In sponges (Porifera), a type of epithelial which sends electrical impulses helps to coordinate the movement of cilia to facilitate filter-feeding (Leys, Mackie and


15
Meech 1999). Some sponge-like animals may have lived their lives floating and moving freely, and it has been hypothesized that movement constitutes a selection pressure for cephalization, as it is more advantageous to have sensory things at the end which first encounters things to sense. In terms of detecting gradients (e.g. it’s warmer over there, more food over here), it is has been proposed that having the same sensory structures bilaterally perpendicular to the direction of movement would provide an advantage. These two selection pressures together could have influenced the evolution of cephalization and bilaterianism (Peterson and Davidson 2000).
As the sensory cells specialized further, some worm-like creatures developed a hard structure which protected those sensory cells (called the notochord, basically a proto-spinal cord), and we call this group the vertebrates. Over time, in different groups, the nerve cord developed concentrations of nerve cells connected in small networks called ganglia, with longer nerve connections down the middle to connect to other tissues/organs in the body. This arrangement is seen in most animals today.
A species of paper wasp was recently shown to accurately and consistently learn wasp faces and associate them with individual recognition (Sheehan and
E.A.Tibbetts 2011);
Caenorhabditis elegans, a nematode worm, can learn and remember mechanical and chemical stimuli which predict aversive chemicals or the presence or absence of food (Ardiel and Rankin 2010); and
Drosophilia fruit flies and
Aplysia sea slugs (a lophotorochozoan) are used to study many learning


16
behaviours because their nervous system structure (involving neurons and synapses) is extremely similar to our own. However, recent work has also demonstrated intelligent behaviours in living creatures without synapses or neurons - (Stock and Zhang 2013) review work on the
Escheria coli bacterial nanobrain; a collection of coiled sensory fibres which is capable of tracking long- term changes, encoding distinct memories, and associating different stimuli via methylation. They note that this methylation system is highly conserved in motile bacteria, but not found in sedentary bacteria, and suggest an important connection with sensory-motor information and the development of intelligence.
Similarly, it was recently demonstrated that the
Mimosa pudica plant (called
‘the sensitive plant’ because it closes its leaves in response to touch) can learn and maintain a memory for over a month (Gagliano, et al. 2014). Even more recently, researchers demonstrated electrical communication via potassium-ion waves in bacterial communities of a biolfilm (Prindle, Liu, et al., Ion channels enable electrical communication in bacterial communities 2015). In a similar vein, amoebae slime molds have been also been shown to anticipate periodic events by rhythmic timing (Saigusa, et al. 2008). These findings emphasize the importance of time, causality, and prediction to intelligent behaviour.
2.3.3 Biochemical systems capable of learning
Chemical kinetics are Turing-complete (Magnasco 1997), and are therefore capable of representing an arbitrary set of operations. Magnasco demonstrates that a flip-flop between a minimum of two states is necessary and sufficient for


17
encoding a chemical memory which, with appropriate energy input, is infinitely long-lived. Coupled with a ‘chemical repeater’ which copies input to output, he demonstrates the theoretical existence of the chemical prerequisites for communication and memory. (Fernando, et al. 2009) demonstrate that these prerequisites can be satisfied by phosphorylation of plasmids, specifically showing a circuit which implements Hebbian learning (see Section 2.3.5 Types of learning, below, for a description of Hebbian learning). It has also been suggested that the cyclic reactions of calcium could form the basis of these required bio-oscillators (Eccles 1983) (De La Fuente, et al. 2015).
These findings suggest that the mechanisms which underlie learning and memory are shared by most, perhaps all living creatures (possibly even abiotic chemical networks as well), but are costly to maintain, and in the vast majority of cases this energy cost exceeds the benefit of most intelligent behaviours. This brings us back to the relationship between intelligence and efficient coding; that what is special about neurons and synapses of biological networks is not the presence of the ability to learn and remember, but rather the efficiency with which information can be stored.
2.3.4 Biological learning and memory
Learning and memory are closely related. A neuroscientific symposium defined learning as a process for acquiring memory, and memory as a behavioural change caused by an experience (Okano, Hirano and Balaban 2000). Using this


18
definition, if a system learns, it is redundant to say that it remembers, so we will concern ourselves primarily with describing learning. The method of storing this change is generally accepted to be modification of the way neurons are connected at their synapses. This idea is called synaptic plasticity –the connections between neurons can be modified, and it is these modifications to the synapses which constitute learning. In particular, neuron A and neuron B can form more synapses with each other, or at a given synapse the amount of neurotransmitter that A releases can be changed by changing its reuptake rate of neurotransmitter from the synaptic cleft, or by increasing the number of synaptic vesicles available for releasing neurotransmitter; neuron B can change the number and type of receptors available for binding neurotransmitters, etc.
Collectively, these properties which allow two neurons to have more or less of an effect on each other are called the synaptic strength, synaptic weight, or synaptic efficacy. If modifications to the synapse increase the synaptic strength between two neurons, this is called long-term potentiation, while a decrease in synaptic strength is called long-term depression.
The exact mechanisms of long-term potentiation and long-term depression are not fully understood. It appears that a rapid or sustained change in calcium concentration affects the behaviour of a calcium-dependent kinase, which can insert/remove neurotransmitter receptors and also change their activity level. At some point during this early phase, a calcium-independent kinase also becomes active, which can begin a cycle of protein synthesis and maintenance which is


19
thought to be the basis of long-term memory. Long-term depression is even less well-understood; it can be triggered by a prolonged lack of activity, or also by certain other patterns of activity. The precise molecular mechanisms of long- term potentiation and depression are a very active area of research. See
(Serrano, Yao and Sacktor 2005), (Bliss, Collingridge and Morris 2004), (Lynch
2004) for active research, and (Kandel, et al. 2012) for review.
2.3.5 Types of learning
Non-associative learning is a change in the strength or probability of a response to a single stimulus, as a result of repeated exposure to that stimulus, often as a reflex response. Habituation is non-associative learning where the change is for the response to become less strong, or less probable after repeated exposure, while in sensitization the response becomes stronger or more probable after repeated exposure. Generally when studying learning we are more interested in associative learning, wherein an association between two stimuli, or a behaviour and a stimulus, is learned. Studies beginning in the 1960s, mostly on the mollusc
Aplysia, that the mechanisms underlying simple habituation and sensitization were also responsible for more complex associative memory (Duerr and Quinn 1982) (Roberts and Glanzman 2003) (Kandel, et al. 2012). Describing the mechanisms of learning, however, does not necessarily tell us exactly why, or when learning occurs. The most widely accepted theory explaining the cause of learning is Spike-Timing-Dependent Plasticity (STDP), based on Hebb’s postulate.


20
Hebbian learning is often paraphrased as “neurons that fire together, wire together”, but was stated by Hebb in 1949 as “when an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A 's efficiency, as one of the cells firing B, is increased”. This phrase covers long-term potentiation, and was amended by others in the 1970s to include long- term depression as well (Feldman 2013). In other words, if neuron A frequently fires just before neuron B, a long-term potentiation takes place such that neuron
B is more likely to fire when stimulated by neuron A in the future, while anti-
Hebbian learning says long-term depression will occur if neuron A frequently fires just after neuron B. In some situations, the effect of the time interval is reversed; long-term potentiation learning triggered if neuron A frequently fires just after neuron B, and long-term depression induced if neuron A frequently fires just before neuron B (Choe 2014) (Letzkus, Kampa and Stuart 2006) (Fino, et al. 2010) (Bell, Caputi and Grant 1997).
These four types of learning can be encompassed by the idea of spike-timing dependent plasticity, which in its most general definition simply says learning is dependent on spike-timing (Feldman 2013). However, in important academic works STDP is sometimes used to refer only to Hebbian learning, or only
Hebbian and anti-Hebbian learning (not including the time-interval-reversal situations) and to add to the confusion, Hebbian learning is often taken to mean only long-term potentiation – see varying definitions in for example (Bi and Poo


21 2001) (Markram, Gerstner and Sjostrom 2012) (Feldman 2013) (Bengio, et al.
2015). And although temporally inverted effects are observed, they are not generally given a name.
To write this more precisely, if during some time interval ??????
0
to ??????
??????
, presynaptic neuron A fires at times ??????
??????[1,2,…,??????]
and postsynaptic neuron B fires at times
??????
??????[1,2,…??????]
, the summation of differences in timing can be written as the integral of
????????????
??????
= ??????
??????
??????
− ??????
??????
??????
for ?????? firing-pairs of neuron A and neuron B. The effects this can have in terms of learning, with terms commonly used in literature, are summarized in Table 1.
Table 1: Summary of possible causes and effects in spike-timing dependent plasticity (STDP)
????????????
??????
= ??????
??????
??????
− ??????
??????
??????
Meaning
Consequence
Long-term potentiation
Long-term depression
∫ ????????????
??????
??????
??????
??????
??????
≤ ??????
Neuron A repeatedly fired first/same time
Hebbian learning
Anti-Hebbian learning
∫ ????????????
??????
??????
??????
??????
??????
> ??????
Neuron B repeatedly fired first
Temporally-inverted
Hebbian learning
Temporally-inverted anti-Hebbian learning
Spike-timing dependent plasticity likely does not represent the full picture of learning in biological synapses (Bi and Poo 2001). Learning can also be non-local, mediated by glial cells or neurotransmitters such as nitrous oxide (NO), which quickly diffuse to neighbouring synapses. (Letzkus, Kampa and Stuart 2006) found that temporal inversion of STDP depended on the relative location of synapses within the dendritic arbor. Long-term depression can be induced by non-repeated excitatory impulses, as well as at inactive synapses when calcium


22
concentration is raised by glia or nearby neural activity, and can also occur at electrical synapses (Haas, Zavala and Landisman 2011). And although we normally view the synapse as ‘belonging’ to neuron B (the post-synaptic neuron),
(Costa, et al. 2015) model STDP with decay-trace spike trains and compare to an in-vivo model, finding that plasticity is more accurately portrayed by employing both pre-synaptic factor modification and post-synaptic factor modification.
2.4 Artificial neural networks
2.4.1 Artificial Intelligence
Much of what we have just presented about learning systems was not known in the 1930s and 1940s, when the field of artificial intelligence grew out of an increased understanding of psychology, statistics, physics, probability, and logic.
The goal of artificial intelligence research is to understand and simulate human- like intelligent behaviour (Russell and Norvig 2003).
Human thought and formal reasoning have been examined since at least 1000
BC by Chinese, Indian, and Greek philosophers. These studies grew into the fields of philosophy and mathematics, and the duplication of intelligent behaviours by machines was suggested as early as the 1200s by Llull, whose ideas were developed mathematically by Leibniz in the 1600s. Through this time, the physiology of the nervous system was increasingly well understood, from Galen’s research injured gladiators around 150 AD, amended with precise


23
experimentation by Vesalius in the 1500s. By the late 1800s, the cartoon neuron described in the first section of this work was almost elucidated (see work by
Gerlach, Purkinje, Valenin, Remak, Schwann, Schlieden, Golgi, Ramon y Cajal), while the Industrial Revolution motivated studies of mechanical physics and conservation of energy (see Carnot, von Mayer, Joule, Helmholz, Clausius,
Kelvin, and others), Laplace’s rediscovery of Bayes’ work laid the foundations of probability theory. The formalization of logic and development of calculating machines in work by Boole, Frege, Babbage, Lovelace, Russell and Whitehead, and Hilbert culminated in the 1930s with work by Godel, Turing, Church, demonstrating that a mechanical device could imitate any conceivable mathematical process. Through the 1940s, it was additionally demonstrated by
Turing and von Neumann that such a mechanical device could be constructed and used to solve problems, such as breaking cryptographic codes. The creation of these computers, combined with an increasing foundation of psychological and neurological research in physiology, and applications of the laws of thermodynamics to information theory (Shannon, Minsky, and others) contributed to interest in creation of an electro-mechanical artificial brain. In
1956 the Dartmouth Summer Research Project on Artificial Intelligence was held, widely considered to be the founding of the field as an academic discipline.
For histories of artificial intelligence research, see (Buchanan 2006) (Russell and
Norvig 2003).


24
Today, ‘artificial intelligence’ is addressed through research in a number of fields including knowledge discovery, data mining, robotics, economics, statistics, neuroscience, and many others.
2.4.2 Machine learning
Machine learning seeks to design algorithms which change in response to data.
This allows the algorithms to act in ways that have not been explicitly programmed. Machine learning is often used almost as a modern synonym for artificial intelligence, but tends to focus more on the computational
(mathematical and software) aspects than robotics, logic, psychology, or biology.
Machine learning methods have proved successful in a variety of domains, particularly in classification and prediction – anywhere the underlying statistics of patterns in data can be recognized and used.
Supervised learning methods require a dataset which contains “correct answers”
– for example, a medical dataset of blood pressure, weight, height, age, and other factors for 10 000 people for whom the diagnosis of a disease is known. The algorithm will find patterns in this data so that it could determine that people with high blood pressure, above a certain age, and with a certain genetic factor, tend to be diagnosed with the disease (a clustering problem); or if presented with a new case, it could say based on its previous information whether or not the person would have the disease (a classification problem). Semi-supervised learning incorporates some examples for which the “correct answer” is not


25
known, usually in situations where there are only a small set of such labelled examples, and uses some kind of similarity metric to relate unlabelled examples to labelled ones. In unsupervised learning, where no labels are known, the algorithm simply searches for structure and patterns in the data. Reinforcement learning is like an active form of semi-supervised or unsupervised learning; there may or may not be a period of training on known examples at the beginning, and then new examples keep coming in, and the algorithm changes with the new information.
Note that machine learning does not equate to just neural networks – many other tools and methods exist. See (Bishop 2006) or Andrew Ng’s Coursera course (Ng 2015) for a good introduction. However, artificial neural networks provide the closest parallel to biological neural networks, and this is the focus of this work.
2.4.3 Artificial neurons
The classic artificial neuron is the McCulloch-Pitts neuron, wherein all inputs are binary (0 or 1), connections have the same fixed weight, calculate their output as a weighted sum, ‘fire forward’ according to the Heaviside step function.
(McCulloch and Pitts 1943):
??????
??????+1
= ?????? [∑ ??????
????????????
??????
??????(??????)
− ??????
??????
??????
]


26
where:
?????? is the Heaviside step function ??????(??????) = {
0, ?????? < 0 1, ?????? > 0
??????
????????????
is the weight from unit ?????? to unit ??????
??????
??????(??????)
is the ??????
??????ℎ
unit’s output at time t
??????
??????
is the threshold of unit ??????
Based on studies of fly visual systems, neurobiologist Rosenblatt expanded on this model to make the weights real-valued, and the perceptron was born.
Networks of perceptrons, as shown in Figure 5, are binary classifiers; that is, they learn a linear separation of the input data.
Figure 5: A ‘network’ of two input perceptrons, ??????
1
and ??????
2
, and one output, ??????, can perform a linear separation between 0s and 1s (an ?????????????????? gate) if it has learned appropriate weights
Generalizing the McCulloch-Pitts neuron above, we have:
??????
??????
= ?????? [∑ ??????
????????????
??????
??????
− ??????
??????
??????
] where:
?????? is the activation function
??????
????????????
is the weight from unit ?????? to unit ??????
??????
??????
is the ??????
??????ℎ
unit’s output at time t
??????
??????
is the ??????
??????ℎ
unit’s output (from the previous layer) at time t
??????
??????
is the bias of unit ??????
????????????????????????
????????????????????????????????????????????????


27
If we consider the activations of an entire layer, we can also write the much more compact vector form:
??????
??????
= ??????(??????
??????
??????
??????−1
+ ??????
??????
) where:
?????? is the activation function
??????
??????
is the vector of weights incoming to layer ??????
??????
??????
is the vector of activations (outputs) for layer l
??????
??????−1
is the vector of activations (inputs) for layer l
??????
??????
is the bias vector
This is the basic neuron of most modern artificial neural networks.
The terms “activation function”, “transfer function”, and “output function” can be a source of confusion between disciplines. In computational neuroscience literature, the term activation function refers to how the neuron calculates its own state, or internal activation value, by taking account of influences from other neurons. In artificial neural networks, this function is almost universally the sum of inputs each weighted by a connection strength, and this weighted sum is not given a particular name. In computational neuroscience, the output function is applied to the activation value to determine the output value that that neuron will send to its connections. In artificial neural networks, this is the function referred to as the activation function, ‘non-linearity’, or ‘non-linear function’, referring to its role in function approximation, and the output of the neuron is variously called the output, activity, activity level, or activation. In computational neuroscience, the activation and output functions taken together are called the transfer function. In artificial neural networks, this term is


28
sometimes used interchangeably with activation function (again, to refer to what a neuroscientist would call an output function), while the term ‘output function’ is rarely used.
This work is primarily concerned with documenting biological influence in artificial neural networks, and does not attempt to be a reference textbook for terminology, so we will somewhat reluctantly follow the very entrenched terminology of artificial networks, referring to the activation function as the function which determines a neuron’s output. Popular choices for the activation function include the logistic sigmoid, hyperbolic tangent, and rectified-linear, whose relative benefits are discussed in the next section. There is occasionally also some naming confusion between the logistic sigmoid and hyperbolic tangent
– both curves are sigmoidal, however the logistic sigmoid is sometimes referred to simply as the ‘sigmoid’.
2.4.4 Artificial neural networks
Particularly when designed for classification or prediction, artificial neural networks are generally arranged in layers. The input layer is the data or
‘stimulus’ to be presented to the network, and the output is the transformed data or ‘response’ to the data, while the hidden layers form the network proper, performing the transformation of the input data via the weight and bias matrices and action of the activation function.


29
On the ‘forward pass’ through the network, each input node calculates its activation based on the data presented (e.g. the pixel colour in an image), and passes this number on to all of its connections in the next layer (typically, layers are fully connected). Neurons in the first hidden layer calculate the weighted sum of their inputs, subtract a bias term, and pass this number through an activation function to determine their outputs. This process proceeds through each hidden layer until the output layer, where the outputs are interpreted relative to a cost function or target value – for example, for a network attempting to classify the digits 0-9, as shown in Figure 6 we might have 10 outputs, and designate each to be one of the numbers. So if the image presented was a picture of the number 9, the cost function would describe
[0,0,0,0,0,0,0,0,0,1] as being the ideal output.


30
Figure 6: An artificial neural network with 1 hidden layer set up to classify the digits 0-9 in a
784-pixel image
Learning then comes down to credit assignment – how can we take this ideal value as compared to the output of the network, and change the weights in the hidden layers to make the network more likely to have a response closer to the target in the future? In most feed-forward networks this question is answered by backpropagation – the output nodes calculate how far they were from the target, measured as some kind of error or distance. In order to reduce that error next time, they look at whether they should increase or decrease their output. This direction of increase/decrease is called the gradient, which ‘points’ (is larger) in the direction of increasing error. The only method the unit has of changing its output is to change the weights of connections coming in to it – to change how


31
much it ‘listens’ to various neurons in the previous layer. To figure out how to change, the neuron makes each of its incoming weights move in the direction of the gradient – so if the neuron needed to output a lower number next time, it increases the weight with neurons that sent negative numbers, and decreases the weight with neurons that sent positive numbers. The exact amount of this update is −??????
????????????
????????????
; the partial derivative of error with respect to the weight, multiplied by a small learning rate (so that the weights don’t change wildly with each new piece of information), subtracted from the previous weight to move in the direction of steepest minimization of error.
There are many variations of gradient-based backpropagation, but virtually all performant methods for training artificial neural networks are based on some kind of gradient descent, as described above. Variations include: adding momentum (a weighted average of the current and previous weight updates); or in unsupervised settings, using contrastive divergence (which minimizes a particular kind of distance between new examples and an equilibrium distribution for visible variables). See (Rojas 1996) (Y. LeCun, et al. 1998)
(Nielsen 2015) for a thorough explanation of backpropagation and variations.
2.4.5 Deep learning
In deep learning, back-propagation follows exactly the same steps as outlined above, however the network structure generally contains more hidden layers – i.e. is deeper. This arrangement means that higher layers form patterns of


32
patterns; transformations occur not only on the input data but on the internal representation of the input data, to form hierarchies of representation. Deep networks are not a recent innovation in concept, but technological advances in the last 5-10 years, particularly relatively low-cost GPUs, have made them orders of magnitude faster to train, and commercial successes with companies like Google, Facebook, Apple, Microsoft, and Bing have made large multi-layer neural networks one of the most popular areas of machine learning research. See
(LeCun, Bengio and Hinton 2015) and (Schmidhuber 2015) for thorough reviews of deep learning.
2.4.6 Deep belief networks
A Boltzmann machine (Ackley, Hinton and Sejnowski 1985) is a stochastic recurrent neural network of binary units, wherein each unit takes the ‘state’ of being either 0 or 1 with a certain probability based on the energy function of the network. Weights between neurons are symmetric (there is no concept of layers), with a positive weight indicating that the ‘hypotheses’ of the units support one another, while a negative weight suggests the two hypotheses should not be accepted. Weights are adjusted by seeing which neurons are in the same state at the same time, and strengthening those connections, as well as sometimes decreasing the weights of those simultaneously active neurons so that weights don’t keep increasing until all units become saturated.


33
Restricted Bolztmann machines constrain a Boltzmann machine’s connections to form a bipartite graph. Stacking these machines to create a layer structure forms a Deep Belief Network (DBN), introduced by (Hinton and Salakhutdinov,
Reducing the dimensionality of data with neural networks 2006). DBNs were one of the first deep architectures to be of practical significance, due to a greedy, layer-by-layer unsupervised training method developed by (Hinton, Osindero and Teh 2006).
2.4.7 Convolutional neural networks (convnets)
A convolution is a multiplication by a particular matrix; often called a filter or feature map. For example, if you multiplied a vertical-line-detection filter by a matrix of pixel values of an image of the number 1, the output of that filter would likely have most of the central values activated. Filters can also perform translations, rotations, and other transformations. Convolutional neural networks usually alternate layers of filter-banks with pooling or other layers, to create a hierarchy of representations, and learn weights not only for normal layers but also weights within the filter (i.e. learning the filter/transformation itself). Convnets are based on biological research in the visual cortex of cats and monkeys (Hubel and Weisel 1968). With recent architectures achieving near- human performance or better on many benchmark datasets, they are something of a poster child for successful deep neural networks.
2.4.8 Rectified linear units


34
The rectified-linear activation function is ??????(??????) = max (0, ??????); in other words, if your weighted sum is positive, output that weighted sum, otherwise output 0. In 2010
Nair and Hinton introduced rectified linear units to deep Boltzmann machines, and showed that despite being unbounded and not differentiable at 0, they improved results (Nair and Hinton 2010). Subsequently (Glorot, Bordes and
Bengio 2011) demonstrated their usefulness in dramatically speeding up training in very large, deep networks, and (Krizhevsky, Sutskever and Hinton,
ImageNet classification with depp convolutional neural networks 2012) used them in a convolutional neural to reduce state-of-the-art image recognition error in that dataset by more than half. ReLUs are now the unit of choice for deep architectures (LeCun, Bengio and Hinton 2015).
2.4.9 Dropout and DropConnect
Dropout was introduced by (Hinton, Srivastava, et al. 2014) to address a common problem in deep neural networks: overfitting. As the name suggests, dropout simply sets the outputs for a random percentage of units to zero, with the aim of preventing feature co-adaptation. Training uses sampling during backpropagation, effectively averaging what the output would have been if many more sparse networks had been run on the input. DropConnect, developed by
(Wan, Zeiler, et al. 2013) is very similar to dropout, but sets weights equal to zero instead of activations. Both of these methods have proven successful for preventing overfitting, although on real-world data which is sparse and noisy,


35
(McMahan, et al. 2013) found dropout to negatively impact representations by removing training data information.
2.4.10 Maxout
Maxout activation functions simply pass on the maximum weighted input they receive, without summation. Maxout networks were introduced by (Goodfellow, et al. 2013) to work in concert with dropout. The maximum weighted input is calculated before dropping out, so active units are never dropped, making the network effectively average only the best-performing models. The authors demonstrate that maxout units can act as rectifiers, absolute-value, or approximately quadratic functions, depending on their inputs, as shown in
Figure 7 below.
Figure 7: From (Goodfellow, et al. 2013), demonstrating how a maxout unit can act as a rectifier or absolute value function, or approximate a quadratic function. ℎ
??????
(??????) is the hidden unit activation function for input ??????.The authors note this is only a 2-D case for illustration; the concept extends equally well to arbitrary convex functions.


36
Chapter 3:
Comparing biological and artificial neural networks
3.1 Introduction
The ubiquity of powerful personal computers has led to an accumulation of truly massive amounts of data about users, connections, news, and other information, as well as widely available means of analyzing this data. Large companies such as Google, Facebook, Amazon, Apple, and Microsoft are in a financial position to offer substantial rewards for innovative research using this data in some way.
Among other factors, this has contributed to a highly productive, empirical- results-driven landscape for artificial neural network research (and machine learning more generally). With so much concrete evidence of their success, it is a good time to revisit some of the intuition motivating artificial neural networks, and relate their performance to modern research in biological neural networks.
We hope that these insights will provide a context for practitioners to connect empirical results with underlying concepts in modern work, and also to re- examine the rich neural network literature of the past 30 years.
3.1.1 Traditional intuition
Although literature in artificial neural network tends to shy away from direct comparisons with biological neural networks, it is often stated in an introductory


37
fashion that the weights of an artificial neural network represent the synapses of a biological neural network, with the sign indicating whether that synapse is excitatory or inhibitory; the bias represents the excitability; and that the activation of an artificial neuron represents the firing rate (Karpathy 2015)
(Nielsen 2015) (Rojas 1996). Recent work has also suggested parallels between spike-timing-dependent plasticity (STDP) and gradient-descent backpropagation
(Bengio, et al. 2015). Most authors emphasize that this comparison makes broad generalizations and is not to be taken literally, but the intuition is helpful in many ways. This work considers this intuition critically, and suggests some modifications.
3.1.2 Inhibition and excitation
Most of the neuron experiments which have motivated artificial neuron models have been performed on excitatory neurons. This is because motor and sensory neurons are much more exposed and easy to study than neurons of the brain, and vertebrate motor and sensory neurons are inherently excitatory. It is interesting to note that this is not true of invertebrates, wherein the effect of the impulse is determined by the type of neurotransmitter released and the receptor it binds (Kandel, et al. 2012). Thinking of neurons as primarily excitatory informs the idea of neural networks in general as being constructive rather than discriminative. This is a subtle but not unimportant distinction. With the success of engineered features in computer vision tasks, we generally think of neural networks as doing the same thing – constructing and combining patterns


38
from input information to arrive at a solution. But the nature of biological neural networks suggests otherwise. The central nervous system receives a constant stream of excitatory information from many sources, and most of it is irrelevant at any given time. Studies in selective visual attention suggest that we cut down visual information to a minimum, combine the most relevant parts through disinhibition (inhibition of inhibitory connections), and fill in most of the details with imagined or assumed information (Hartline, Wagner and Ratliff 1956)
(Handy, Jha and Mangun 1999) (Wilson 2003) (Kandel, et al. 2012).
Inhibition is important for learning and memory. As (Hulbert and Anderson
2008) suggest, consider the problem of remembering where you parked your car, remembering a friend’s new address, or speaking a language you have recently learned. Remembering all of the cars you have ever parked, all of your friend’s previous addresses, or how to say something in a different language is a difficult problem to overcome; a body of experimental evidence demonstrates that the greater the number of memory competitors, the more difficult is the problem of retrieval of a specific memory. With excitatory-only models of memory, the relatively weak stimulus of a new parking spot, address, or word would need to be repeated a number of times exceeding the reinforcement of the old memory, but studies indicate that old or conflicting memories are not just supplanted, but actually suppressed (Jonides, et al. 1998). (Johansson, et al. 2006) recorded event-related potentials from participants engaging in a memory retrieval task, finding that while excitatory potentials dominated during the activity, inhibition


39
dominated afterwards, and ability to retrieve that memory later was proportional to the amount of inhibitory activity. (Hulbert and Anderson 2008) interpret this and other similar experiments to indicate that inhibition ‘prunes’ new information to fit it in with pre-existing memories, and thus store new memories more efficiently. Inhibition also plays an important role in sleep and memory formation (Siegel 2001) (Rasch and Born 2013).
The abstract of (McCulloch and Pitts 1943) seminal paper begins ‘Because of the
“all-or-none” character of nervous activity, neural events and the relations among them can be treated by means of propositional logic’. Today, no neurologist would characterize the character of nervous activity as being “all or none”, and networks of binary threshold neurons are not empirically useful for solving most problems. (Rojas 1996) presents a proof that networks of
McCulloch-Pitts-type neurons with no inhibition can only implement monotonic logical functions – that is, negative weights are required to implement functions involving the ‘NOT’ operator – this is well known, and virtually all artificial neural network implementations consider negative weights.
One of the first studies of inhibitory influence in biological neural networks was
(Hartline, Wagner and Ratliff 1956), demonstrating the importance of lateral competitive inhibition in the retina of the horseshoe crab. This finding hints at the importance of inhibition in complex networks, and in vision in particular.
Despite this early result, most artificial neural network texts mention inhibition only passing, to say that it is represented by negative weights. However, this


40
does not represent the full picture of effects that negative values can have in neural networks. We suggest that their primary importance is to give the possibility of equal weight to ‘information for a hypothesis’ and ‘information against a hypothesis’.
3.1.3 Accidental bias
Let us trace negative and positive values in a neural network in more detail. In the forward pass through an artificial neural network, if inputs are normalized to a mean of 0, and weights are initialized randomly, the output of the first layer can be positive or negative with equal probability. At an arbitrary unit in the next layer, a summation of the positive and negative numbers from each of the units in the first layer will be made; that is, we can describe the weight as the effect neuron A has on neuron B’s summation. Neuron B then calculates its output value – if it employs an activation function which can take negative values, this number will be positive if the majority of the input was excitatory, or negative if the majority of the input was negative. Incoming to the next layer, this value can then be multiplied by either a positive or negative number, depending on the weight of the connection; if negative, the ‘intended’ effect of the output of the neuron in the previous layer is reversed. At the output layer, values are compared to the target, and the distance/cost/error of the output is calculated. From this value, at each neuron the gradient with respect to the output is calculated, pointing in the direction of greatest error, and multiplied by the derivative of the activation function at that weight to obtain the gradient


41
with respect to each weight. If the activation function is always increasing, its derivative is always positive, so the sign of the update depends only on the sign of the gradient.
It may seem pedantic to go through these operations by hand, but they have implications for how the network functions which are not often considered. At each step – input layer, first weights, first hidden layer units, second layer weights, second layer units (etc. for all hidden layers), output units, weight update – information changes and the network has an opportunity to react to this information. Squashing the range of any of this information as compared to the other information in the network will make the network dynamics unstable and slow, as the network attempts to learn this artificial (and accidental) bias.
The implications of these accidental biases for different networks are the topic of section 3.5 Application of proposed biological intuition.


42 3.2 Proposed biological intuition for artificial neurons
Motivated by the discussion above, the core intuition of these proposed changes is to consider negative values in neural networks not as a lack of positive information, but as the presence of negative information. Table 2 summarizes current thinking about the activation, weight, partial derivative of cost with respect to activation, and derivative of activation function at a given weight, including what it means for them to be negative. Table 3 summarizes the suggested modifications to the classical intuition.
Table 2: Conventional view of interpretations of neural network properties
Property
Magnitude
Sign
+

Activation
Firing rate
Firing rate n/a
Weight
Synaptic strength
Excitatory connection
Inhibitory connection



Share with your friends:
  1   2


The database is protected by copyright ©psyessay.org 2017
send message

    Main page
mental health
health sciences
gandhi university
Rajiv gandhi
Chapter introduction
multiple choice
research methods
south africa
language acquisition
Relationship between
qualitative research
literature review
Curriculum vitae
early childhood
relationship between
Masaryk university
nervous system
Course title
young people
Multiple choice
bangalore karnataka
state university
Original article
academic performance
essay plans
social psychology
psychology chapter
Front matter
United states
Research proposal
sciences bangalore
Mental health
compassion publications
workplace bullying
publications sorted
comparative study
chapter outline
mental illness
Course outline
decision making
sciences karnataka
working memory
Literature review
clinical psychology
college students
systematic review
problem solving
research proposal
human rights
Learning objectives
karnataka proforma