猫咪怎样启发了人工神经网络的诞生？

zero

Share to

six point one four

intellectual

The Intellectual

Source: pixabay

By Zhang Tianrong

The year 2012 is a turning point for neural networks. An epoch-making model, AlexNet, was born. In an image recognition contest called ImageNet, it won the championship with an absolute advantage that the recognition rate was far higher than the second place by 10.9 percentage points, which caused a great sensation in the field of artificial intelligence. The secret of the success of the three men's AlexNet is the use of "multi-layer convolutional artificial neural network". All the words involved in this phrase can be roughly understood by "just as the name implies". What does the word "convolution" mean? Today we will talk about this topic.

Inspiration of Meow's Vision

The story goes back to the early 1960s, when David Hubel and Torsten Wiesel, two Harvard neurobiologists, conducted an interesting cat experiment, as shown in Figure 1. They used a slide projector to show the cat a specific pattern, and then recorded the electrical activity of each neuron in the cat's brain [1]. They found that specific patterns stimulated activity in specific parts of the cat's brain. Because of their outstanding contributions to visual information processing, they won the 1981 Nobel Prize in Physiology or Medicine.

Figure 1: The neurobiological experiment of Harvard University researchers on cats in 1962

Harvard scholars' experiments have proved that the response of visual characteristics in the cerebral cortex is achieved through different cells. Among them, simple cells perceive light information and complex cells perceive motion information. Around 1980, Japanese scientist Fukushima Bangyan was inspired by the cat biological experiment, simulated the biological vision system and proposed a hierarchical multi-layer artificial neural network, that is, the "neurocognitive" system, which is the predecessor of today's convolutional neural network. In this paper, Fukushima proposed a neural network structure including convolution layer and pooling layer.

Fukushima was from a poor family when he was young, but his curiosity made him passionate about electronic technology. Later, he obtained a doctor's degree in electrical engineering from Kyoto University. In 1965, he joined a visual and auditory information processing research group to study biological brain. Later, Fukushima worked with neurophysiologists and psychologists to assemble artificial neural networks.

In 1979, the "Neocognitron" neurocognitive system came out. The inspiration came from two kinds of nerve cells known to exist in the biological primary visual cortex: simple "S" cells and complex "C" cells, which later evolved into the convolution layer and pooling layer in the current neural network, as shown in Figure 2 [2].

Fukushima is 88 years old this year. Five years ago, he published a research paper on neural networks.

Figure 2: Fukushima's "neurocognitive" system in 1980

In fact, Fukushima's cognitive system 40 years ago already had the basic configuration of convolutional neural network, but at that time, the neurons of this network were all designed manually, they would not automatically adjust according to the results, and the learning ability was not strong. Therefore, it can only be limited to the initial stage of recognizing a few simple numbers.

It is better to come early than to come accidentally. The convolution method was put into practice in 1998 after the French computer scientist Yann LeCun (1960 -) applied back propagation to the training of convolutional neural networks.

Yang Likun was born near Paris, France. In 1983, he received an engineer's degree from Paris University of Electronic Engineers, and in 1987, he received a doctor's degree in computer science from Paris Sixth University. Later, he completed postdoctoral work at the University of Toronto under the guidance of Jeffrey Hinton, who was the winner of the Turing Prize in 2018.

Figure 3: Yang Likun

In 1986, Yang Likun, who was studying for a doctorate, put aside another research work and began to focus on reverse communication. Inspired by the work of Huber, Wezel and Fukushima Bangyan, and fascinated by the study of mammalian visual cortex, he envisioned a multi-layer network architecture that could combine the alternating and reverse propagation training of simple cells and complex cells. He believes that this type of network is very suitable for image recognition [3].

In 1988, Yang Likun joined Bell Laboratories in New Jersey. Here, he developed several machine learning methods including convolutional neural networks [4]. And it really realized the convolutional neural network. Bell Labs named it LeNet, just like his surname LeCun, which is the first name of the convolutional network.

How does the human eye recognize objects?

Image recognition has always been a hot spot in the research of artificial intelligence, for some reasons. Human knowledge originally comes from the observation of the world, starting from the human eye and extending to telescopes, microscopes and other observation tools. Our great science is based on many observational data.

Computers want to imitate human functions and ideas, including the process of human eye recognition. The eye is a very complex and delicate organ. In addition to the connection and feedback with the brain nerve, the biological vision mechanism formed is an advanced product of continuous evolution over millions of years. Humans have not yet fully understood it, and of course it is not easy to imitate it.

How does the human eye work? You may think that this is very simple. The human eye is an optical system. The reflected light of an object is refracted and imaged on the retina through the lens. Then it is transmitted to the brain by visual nerve perception. So people can see the object. At the beginning of AI, it was intended to simulate vision in this way: the receiving device scanned the entire image into pixels and sent it to the neural network for recognition, as shown in Figure 4a.

However, human eye recognition does not seem so simple. How does the human eye recognize various patterns? More specifically, how can the human eye recognize a handwritten letter x?

Figure 4: Machine recognition and human eye recognition

From our experience, we know that the human eye can "see at a glance" that there is an x in every small figure in Figure 1b, no matter where the x is placed? Is it big or small? Is it red or blue? Do you have a background picture?

Scientists hope that machines can do this as much as possible, so someone has brought out the magic weapon of "convolution".

What is convolution?

In fact, the concept of convolution appeared much earlier than neural networks, as shown in the mathematical expression at the top of Figure 4: it is an operation that obtains another function g (r) by multiplying two functions f (r ') and h (r-r') and then integrating r '.

Although the name is different, operations similar to convolution first appeared in the mathematical derivation of d'Alembert in 1754, and then were used by other mathematicians. However, the term was officially introduced in 1902.

Later, in communication engineering, convolution is used to describe the relationship between signals and systems. For any input f (t), the output g (t) of the linear system is expressed as the convolution of the impulse response function h (t) and the input. For example, when a singer performs with a microphone, the singing sound heard through the microphone is different from the sound wave before the microphone, because the microphone delays and attenuates the input signal. If the microphone is approximated to a linear system and its function h (t) is used to express its effect on the signal, then the output g (t) of the microphone is the convolution of the input f (t) and h (t). Another interesting fact is that if the input to the microphone is a Dirac d-function, the output of the microphone is exactly its impulse response function h (t).

Carefully observe the integral expression of the convolution, and it will be found that the sign of the integral variable r 'of the h function in the integral sign is negative. If r is the time t, that is to say, the h function is "rolled" (flipped over to time) to the past value, multiplied by the current f value, and finally stacked (integrated) these product values to obtain convolution. This is easy to understand in the microphone example, because the output of the microphone at each moment is not only related to the current input, but also related to the past input.

Figure 5: Convolution

To sum up the above paragraph, we can understand convolution more briefly: convolution is the weight superposition of function f on weight function h.

The beauty of mathematics lies in abstraction. The abstract concept can be applied to other different occasions. For example, convolution can be used for continuous functions (such as signals and systems) and discrete cases (such as probability and statistics); The integral variable of convolution can be time, space or multi-dimensional space. For example, the application of convolution in discrete multi-dimensional space is to use it in image recognition of AI.

Convolution layer and convolution calculation

Now let's think about how a computer can find an "X" when it is given a pattern that includes an "X"? One possible way is to let the computer store a standard "X" pattern, and then put the standard graph into each part of the input graph to compare. If a part is consistent with the standard graph, it is determined that an "X" pattern has been found. Furthermore, the standard drawing should also have the function of zooming in, zooming in and rotating.

As mentioned above, the human eye can see a certain pattern in the figure "at a glance". In fact, there is a mathematical model for this "at a glance", which is the d-function. If the d-function is used in convolution, because it only has a value at an isolated point, it can "extract" the value of the point of the f function.

As shown in Figure 6, the standard graph (the convolution kernel in the graph) is like a pair of eyes. Its 3x3 window slides on the 7x7 input data, just like the eyes look around the graph to extract the part that meets the standard. This process of comparison and extraction is completed by the convolution operation. Specific calculation process: convolution is to multiply the 3x3 matrix cell values scanned by the window and the 3x3 matrix cell values of the convolution core one by one and add them all up, and write the results to the 1x1 position corresponding to the center of the window. The final output matrix (larger, 7x7 in the figure) is the result of convolution.

Figure 6: Convolution calculation when neural network recognizes x

In other words, the function of convolution kernel is similar to the d-function representing a pattern, which can "sample" the pattern from the original image. In terms of the language used to describe the convolution mathematical formula, the input matrix on the left side of Figure 6 is the f function; The convolution kernel is h function; The rightmost output is the g function of the result of the convolution calculation. The matrix element of the convolution kernel (3x3 matrix in the figure) is the weight coefficient. The weight coefficient of convolution kernel, like the weight coefficient between connecting layers, can also be optimized through the learning and training process. In addition, it is necessary to use an appropriate activation function to achieve the purpose of nonlinearity.

The function of convolution is "extraction". What is extraction? In image recognition, it is generally to extract the outline of things.

Pooling layer and convolutional neural network

Let's look back and think about more characteristics of human eye perception. When we learned from the contour that it was a cat, we found an interesting and useful fact: even if we reduced the image a lot, we could still judge that it was a cat. This indicates that the stored contour image has a lot of redundant information.

We don't need too much redundant information, because they waste computer storage space. Moreover, sometimes too much information is icing on the cake, which increases the error rate of judgment. Therefore, we send the calculation results of the convolution layer shown in Figure 6 to a network layer called "pooling". The role of pooling is to downsample the feature map and reduce the redundancy of information, thus reducing the model parameters and computing costs of the network, reducing the risk of overfitting, and being less sensitive to the changes in feature positions in the input image, such as visual pattern drift such as deformation, distortion, translation, etc.

The above explanation is for a convolution layer plus a pooling layer to identify a simple pattern. In fact, for a large number of input color complex pictures, there are many patterns that need to be recognized, so many complex factors must be considered. The feature extractor described above is not designed "manually", but automatically generated through learning. Automation is the charm of multi-layer network using back-propagation training. However, the basic idea is consistent with the above: multiple convolution layers plus nonlinear plus pooling layers are enough to recognize various image contents from simple patterns (corners, edges, etc.) to complex objects (faces, chairs, cars, etc.), as shown in Figure 7.

Figure 7: Overall diagram of convolutional neural network

The calculations in convolution pooling seem to be multiplicative superposition, and their overall role is to extract important information and reduce dimensions. In order to better understand the role of these two layers of neural networks, we can also compare them with the Fourier analysis of sound signals. The general sound signal (such as a piece of music) is a rather complex curve in the time domain, which needs a large amount of data at each time to represent. If Fourier transform is applied to the frequency domain, only a few frequency spectrums, fundamental frequencies and overtones can be represented. For example, in the simplest case, a single frequency sound wave is a series of sine function values of intensity time transformation in the time domain, while in the frequency domain after Fourier transformation, it is only a d-function. In other words, Fourier transform can effectively extract and store the main components of the sound signal, and reduce the dimension of the description data. Convolution operation also has a similar role in neural networks: first, it abstracts important components and discards redundant information; second, it reduces the dimension of data matrix to save computing time and storage space. However, when convolutional neural network is applied to image recognition, it extracts the spatial change information of the image, not the time spectrum.

Convolutional neural network is the most widely known application of face recognition technology, which we often see in mobile phone photos. For example, as shown in Figure 8, a "face" can be seen as a hierarchical superposition of simple patterns. The first hidden layer learns the contour texture (edge features) on the face, the second hidden layer learns the "shape" such as eyes and nose formed by edges, and the third hidden layer learns the face "pattern" composed of "shape", The targets extracted at each level are more and more abstract, and things are identified (yes or no) through features in the final output.

Figure 8: The classification capability of each layer is becoming more and more "abstract"

Although neural network originated from the simulation of brain, its later development was guided to a greater extent by mathematical theory and statistical methods. Just as the development process of aircraft, a means of transportation, originated from the imitation of bird flight, the structure of modern aircraft is not the same as bird body structure.

reference:

[1]Receptive fields, binocular interaction and functional architecture in the cat's visual cortex., D. H. Hubel and T. N. Wiesel, The Journal of Physiology（1962）

https://www.aminer.cn/archive/receptive-fields-binocular-interaction-and-functional-architecture-in-the-cat-s-visual-cortex/55a5761e612c6b12ab1cc946

[2]Fukushima, K. (1980) Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics, 36, 193-202.

https://doi.org/10.1007/BF00344251

[3] The Road to Science: People, Machines and the Future When machines think, what will humans do?, Author: [France] Yang Likun, Press: CITIC Publishing Group, 2021-8-1.

[4]Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541–551, Winter 1989.

Special statement: The above content (including pictures or videos, if any) is uploaded and released by users of "Netease" on our media platform, and this platform only provides information storage services.

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.