The multilayer perceptron is a special case of the Rosenblatt perceptron , in which a single backpropagation algorithm trains all layers. The name for historical reasons does not reflect the features of this type of perceptron, that is, it is not related to the fact that there are several layers in it (since Rosenblatt's perceptron also had several layers). The peculiarity is the presence of more than one student layer (usually two or three). The need for a large number of learning layers disappears, since theoretically a single hidden layer is enough to recode the input representation in such a way as to obtain linear separability for the output representation. There is an assumption that using a larger number of layers, you can reduce the number of elements in them, that is, the total number of elements in the layers will be less than if you use one hidden layer. This assumption is successfully used in the technology of deep learning and has a rationale  .
The history of the formation of the concept of multilayer perceptron
All types of perceptrons proposed by Rosenblatt, since 1958, are multi-layered according to the modern classification. However, in the 1970s, interest in perceptrons decreased, and in 1986, Rumelhart designed the multilayer perceptron again. At the same time, due to Rumelhart’s terminological inaccuracy, the idea that the Rosenblatt initial perceptron was primitive and single-layered was spread in the literature  , and only Rummelhart justified the need for the introduction of hidden layers.
In his studies, Rosenblatt used mainly elementary perceptron from three layers, and the weights of the first layer (between S and A elements) were chosen randomly and then fixed. The combination of randomness and a large number of neurons in the A-layer provided a high probability of falling into such a hyperspace, in which there was linear separability and guaranteed convergence of the learning process. However, such a technique was not sufficiently studied and understood at that time, and some scientists incorrectly perceived it as having only one “working” layer, and did not pay enough attention to the fixed first layer.
In 1969, Minsky and Papert in their book conducted a critical analysis of the elementary perceptron , identifying a number of limitations, which caused a decline in interest in perceptrons from researchers. Returning many years later to the subject, Rummelhart set out to create a more efficient and practical perceptron, but for this it was necessary to at least make an attempt to demonstrate the absence of restrictions described by Minsky.
In 1986, Rumelhart publishes a collection of articles (collective monograph)  , where he returns to the discussion on perceptrons, and justifies the need for an algorithm for back-propagation of error, the essence of which is the need to train not only the second but also the first layer. At the same time, he cites a drawing from the book of Minsk and Pipert and signs it: “A single-layer perceptron analyzed by Minsky and Pipert”. The inaccuracy was that Minsky and Pipert did not analyze the single-layer perceptron; The Rosenblatt perceptron depicted in the figure was presented by Rumelhart as a single layer, since the first layer with fixed weights was not taken into account, which left without attention the fact that the first layer played a significant role in transforming the input signal into a multidimensional linearly separable space.
As a result, Rumelhart formulates the wrong conclusion:
In their famous book, Perceptrons, Minsky and Papert (1969) document the limitations of the perceptron. The simplest example of a function that cannot be computed by the perceptron is “exclusive or” (XOR).
The reference to Minsky is not accurate enough here - Minsky did not address the problem of XOR anywhere throughout his book. In addition, Rosenblatt's perceptron solves the XOR problem. But the emergence of the term “single-layer perceptron” became the basis of a number of misunderstandings, included in a number of subsequent lectures and textbooks. At the same time, Minsky made a not sufficiently substantiated statement that “the perceptron works flawlessly only on the condition that the set of initial data is linearly separable, ” which could be further misinterpreted.
But already the author of the compactness hypothesis Braverman E. M. as far back as 1964  demonstrated with an example how Rosenblatt's perceptron solves an arbitrary task without requiring the initial data to satisfy the compactness hypothesis, translating into modern terminology only the sufficiency and representativeness of the training set was required:
When interpreting the work of a perceptron from the point of view of the potential algorithm, we did not impose any specific requirements on the planes realized by A-elements. We were indifferent to the coefficients of the planes and the direction of their "positive" sides. It only required that, while moving along a straight line in the space of receptors, we intersect “from plus to minus” a sufficiently large number of planes. In other words, in the receptor space there should have been quite a lot of differently oriented planes.
In addition, at the moment this statement of Minsky is refuted by S. S. Yakovlev, who, analyzing Rosenblatt’s works, indicated what architecture features a perceptron (or any other neural network ) should have in order to form a feature space that satisfies the compactness hypothesis . Or the same thing, described the conditions under which a high probability (more than 99.99%) of falling into such a hyperspace, in which there is linear separability and guaranteed convergence of the learning process, is ensured. [five]
Further, Rumelhart writes:
Thus, we see that XOR is not solvable in two dimensions, but if we add the corresponding third dimension, that is, the corresponding new feature, the problem is solvable. In addition, if you use a multilayer perceptron, it is possible to take the original two-dimensional problem and convert it to the corresponding three-dimensional problem, making it solvable. Indeed, Minsky and Papert knew that it is always possible to transform any unsolvable problem into a solvable in a multilayer perceptron. In the more general case of multilayer networks, we categorize the elements into three classes: input elements that receive input samples directly; output elements for which there is associative learning or selected input; and hidden items that do not receive input directly and do not have direct feedback. This is a stock of elements on the basis of which new properties and new internal representations can be obtained. The problem is to know what new properties are needed to solve the problem. In short, we should be able to train intermediate layers. The question is how? The original perceptron learning procedure refers to no more than one layer. Minsky and Papert believed that no such general procedure could be found.
Here, Rumelhart sees the solution in that it is necessary to “ take the original two-dimensional problem and transform it into an appropriate three-dimensional problem ”, and justifies the fundamental importance of multilayering, mistakenly believing that Rosenblatt’s perceptron does not possess this quality. The only difference is that Rosenblatt avoided learning the first layer, using his random projection onto multidimensional space. In addition, Rosenblatt considered a complete analogue of Rumelhart's multilayer perceptron, called a perceptron with SA variable connections , where, among other things, it was proved that a learning procedure similar to back propagation of an error cannot always guarantee that a solution is reached (to ensure convergence).
It should be noted that not only Rumelhart developed the method of back propagation of error , but theoretically there were already works by other researchers devoted to the same question:
- Werbos PJ 
- Galushkin A.I.  .
- S. I. Bartsev and V. A. Okhonin, Krasnoyarsk Group, publish a more general version of the method of back-propagation of error — the duality principle, applicable also to infinite-dimensional systems, systems with delay, etc.  
but it was Rummelhart’s work that revived practical interest in perceptrons.
In this connection, the words of Wasserman  are interesting:
|Although such duplication is common for every scientific field, in artificial neural networks this situation is much more serious because of the border nature of the subject of the research itself. Studies on neural networks are published in such various books and journals that even the most qualified researcher needs considerable effort to be aware of all the important work in this area.|
Because of these inaccuracies in the literature, the erroneous view of the primitiveness of the Rosenblatt perceptron has spread, and that only Rummelhart created his multilayer perceptron, opening up the possibility in principle to produce a linear separation and solve the XOR problem, although Rummelhart’s discovery consisted mainly in the practical modification of the multilayer scheme and the development of a fundamentally new method learning.
Differences between the multilayer perceptron and the Rosenblatt perceptron
In 1988, Minsky reprinted the book "Perceptrons" in which he included new chapters. They, in particular, analyze the differences between perceptron training by the method of error correction and Rumelhart multilayer perceptron training by the method of back propagation of error . Minsky showed that there are no qualitative differences, both methods solve comparable tasks and with the same efficiency and limitations. The only difference is in the way to achieve a solution.
Among the differences between the Rumelhart multilayer perceptron and the Rosenblatt perceptron are the following:
- The use of a nonlinear activation function, usually sigmoidal.
- The number of student layers is more than one. Most often, applications use no more than three.
- Signals arriving at the input and output from the output are not binary, but can be encoded with decimal numbers that need to be normalized so that the values are in the range from 0 to 1 (normalization is necessary at least for the output data, in accordance with the activation function - sigmoid) .
- Arbitrary linking architecture is allowed (including fully connected networks).
- The network error is not calculated as the number of incorrect images after the iteration of training, but as a certain statistical measure of the discrepancy between the desired and the obtained value.
- Training is conducted not until there are no errors after training, but before stabilization of the weighting factors during training or is interrupted earlier in order to avoid retraining.
The multilayer perceptron will have functional advantages over the Rosenblatt perceptron only if in response to stimuli some reaction is not just performed (since a reaction of any type can be obtained already in the perceptron ), but will result in an increase in the efficiency of developing such reactions . For example, the ability to generalize will improve, that is, to the correct reactions to stimuli that the perceptron did not learn. But at the moment there are no such generalizing theorems, there is only a mass of studies of various standardized tests that compare different architectures.
- Osh Yoshua Bengio, Aaron Courville, Pascal Vincent Representation Learning: A Review and New Perspectives , 2014
- for example, in Wasserman, F. Neurocomputer Engineering: Theory and Practice = Neural Computing. Theory and Practice. - M .: Mir, 1992. - 240 p. - ISBN 5-03-002115-9 . Archived copy of June 30, 2009 on Wayback Machine
- Parallel Distributed Processing: Explorations in the Microstructures of Cognition / Ed. by Rumelhart DE and McClelland JL — Cambridge, MA: MIT Press, 1986.
- Learning machines for the classification of objects, Arkadyev AG, Braverman E. M., Nauka Publishing House, Main editors of physical and mathematical literature, M., 1971, 192 p.
- S. Jakovlev. Perceptron architecture, which provides a compact description of images = Perceptron architecture, Scientific University of Riga, RTU. - Riga, 2009.
- Werbos PJ 1974. Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harward University
- Galushkin A.I. Synthesis of multilayer pattern recognition systems. - M .: "Energy", 1974.
- Bartsev S. I., Okhonin V. A. Adaptive information processing networks. Krasnoyarsk: Institute of Physics, Siberian Branch of the Academy of Sciences of the USSR, 1986. Preprint N 59B. - 20 s.
- Gorban, A.N., Training of neural networks . M .: ed. USSR-USA Paragraph JV, 1990. 160 p.
- Wassermen, F. Neurocomputer Engineering: Theory and Practice = Neural Computing. Theory and Practice. - M .: Mir, 1992. - 240 p.
- Error back propagation method
- F. Wasserman. Neurocomputer technology: Theory and practice. - M .: "The World", 1992.
- Simon Heikin. Neural Networks: Full Course = Neural Networks: A Comprehensive Foundation. - 2nd ed. - M .: “Williams” , 2006. - P. 1104. - ISBN 0-13-273350-1 .
- Kevin Swingler's book “Applying Neural Networks. A practical Guide (translation by Yu. P. Masloboev) (inaccessible link)
- Mirkes EM , Neuroinformatics. A manual with programs for laboratory work. 2003. ISBN 5-7636-0477-6 . Contains lectures and software, including - for modeling multilayer perceptrons
- The website of Viktor Tsaregorodtsev, containing scientific articles on the use of multilayer perceptron
- Flood: An open source neural networks C ++ library .
- Weka: open source data mining software with multilayer perceptron implementation .