Neural gas

Neuron fission with maximum error in an “expanding neural gas”

An expanding neural gas is an algorithm that allows adaptive clustering of input data, that is, not only divide the space into clusters, but also determine the required number based on the characteristics of the data itself. This is a new class of computing mechanisms. The number and location of artificial neurons in the space of signs is not set in advance, but is calculated in the process of training models in accordance with the features of the input data, independently adjusting to them ^[1]

Content

Creation History

There are techniques that can distinguish the most similar objects in space and form groups of them. In the process of analysis, many objects are organized into subsets based on measured similarities. Usually, methods are based on a standard scheme: optimizing the relationship between the spatial arrangement of vectors and the set of objects, such that each vector determines the structure of the clusters. However, most techniques have two significant drawbacks: the analysis depends on a given number of clusters and the separation into clusters is localized in time. All modern clustering methods were static and could not adapt the results, if new data was added to the data, it was necessary to re-execute the algorithm. In the 90s, researchers of artificial neural networks ^[2] came to the conclusion that it is necessary to develop a new class of computational mechanisms. Such a method, so that the number and location of artificial neurons in the space of signs is not set in advance, but is calculated in the process of training such models in accordance with the features of the input data, adjusting independently to them.

Definition

“An expanding neural gas is an algorithm that allows adaptive clustering of input data, that is, not only dividing the space into clusters, but also determining the required number based on the characteristics of the data itself. An expanding neural gas does not require a priori information about data, such as an estimate of the number of clusters or the shape of clusters. ” ^[3] In this model, the neighborhood of nodes is not fixed, but dynamically changes as clustering improves. Variables are not only neighborhood relations, but also the number of cluster neurons.

Algorithm Description

Starting with just two neurons, the algorithm sequentially changes (for the most part, increases) their number, while creating a set of connections between neurons that best fits the distribution of input vectors. Each neuron has an internal variable in which a “local error” accumulates. Connections between nodes are characterized by a variable called age. ^[four]

First, two nodes are created (hereinafter, node = neuron) with weight vectors allowed by the distribution of input vectors and zero values of local errors;
Nodes are connected by a link that can be used to set age. At the initial stage, age is 0.
Then, a vector is fed to the input of the neural network ${\vec {X}}$ ${\ displaystyle {\ vec {X}}}$ .
The next step is two neurons $S$ ${\ displaystyle S}$ and $T$ ${\ displaystyle T}$ closest to ${\vec {X}}$ ${\ displaystyle {\ vec {X}}}$ ( $S$ ${\ displaystyle S}$ closer than $T$ ${\ displaystyle T}$ ), i.e. nodes with weight vectors ${\vec {W_{s}}}$ ${\ displaystyle {\ vec {W_ {s}}}}$ and ${\vec {W_{t}}}$ ${\ displaystyle {\ vec {W_ {t}}}}$ such that $\left\|{\vec {W_{s}}}-{\vec {X}}\right\|$ ${\ displaystyle \ left \ | {\ vec {W_ {s}}} - {\ vec {X}} \ right \ |}$ - minimum, and $\left\|{\vec {W_{t}}}-{\vec {X}}\right\|$ ${\ displaystyle \ left \ | {\ vec {W_ {t}}} - {\ vec {X}} \ right \ |}$ - the second minimum distance value among all nodes.
The local error of the closest neuron - the winner is updated $S$ ${\ displaystyle S}$ , the square of the distance between the vectors is added to it ${\vec {W_{s}}}$ ${\ displaystyle {\ vec {W_ {s}}}}$ and ${\vec {X}}$ ${\ displaystyle {\ vec {X}}}$ .
$E_{s}\Rightarrow E_{s}+\left\|{\vec {W_{s}}}-{\vec {X}}\right\|^{2}$ ${\ displaystyle E_ {s} \ Rightarrow E_ {s} + \ left \ | {\ vec {W_ {s}}} - {\ vec {X}} \ right \ | ^ {2}}$
This procedure leads to the fact that the most often winning nodes, that is, those in the vicinity of which the largest number of input signals fall, have the greatest error value, which means that these areas become the main candidates for “compaction” by adding new nodes.
Winner neuron $S$ ${\ displaystyle S}$ and all its topological neighbors (i.e. all neurons $N$ ${\ displaystyle N}$ having a connection with the winner) are shifted towards the input vector by distances equal to fractions $\varepsilon _{w}$ ${\ displaystyle \ varepsilon _ {w}}$ and $\varepsilon _{n}$ ${\ displaystyle \ varepsilon _ {n}}$ from complete.
${\vec {W_{s}}}\Rightarrow {\vec {W_{s}}}+\varepsilon _{w}({\vec {W_{s}}}-{\vec {X}})$ ${\ displaystyle {\ vec {W_ {s}}} \ Rightarrow {\ vec {W_ {s}}} + \ varepsilon _ {w} ({\ vec {W_ {s}}} - {\ vec {X} })}$
${\vec {W_{n}}}\Rightarrow {\vec {W_{n}}}+\varepsilon _{n}({\vec {W_{n}}}-{\vec {X}})$ ${\ displaystyle {\ vec {W_ {n}}} \ Rightarrow {\ vec {W_ {n}}} + \ varepsilon _ {n} ({\ vec {W_ {n}}} - {\ vec {X} })}$

The shift of the nodes towards the input vector at this step means that the winner seeks to “average” his position among the input signals located in its vicinity. At the same time, the best neuron slightly “pulls” towards the signal and its neighbors.

Increase by 1 age all connections coming from the winner $S$ ${\ displaystyle S}$ .
If the two best neurons $S$ ${\ displaystyle S}$ and $T$ ${\ displaystyle T}$ connected, reset the age of their connection. Otherwise, create a connection between them.
Remove all compounds older than the maximum age. If after this there are neurons that do not have connections with other nodes, remove these neurons.
If the current iteration number is a multiple $\lambda$ ${\ displaystyle \ lambda}$ , and the network size limit is not reached, create a new neuron $R$ ${\ displaystyle R}$ according to the rules. Over time, after several cycles of displacements, information accumulates, based on which a decision is made about the place where a new neuron should be added. This process is a correction of variable errors of all neurons in the layer. This is necessary in order for the network to “forget” the old input vectors and better respond to new ones. Thus, it is possible to use the expanding neural gas to adapt the neural network to non-stationary, namely, slowly drifting distribution of input signals.
Find neuron $U$ ${\ displaystyle U}$ with the greatest local error.
Among the neighbors $U$ ${\ displaystyle U}$ find a neuron $V$ ${\ displaystyle V}$ with the maximum error.
Create node $R$ ${\ displaystyle R}$ “In the middle” between $U$ ${\ displaystyle U}$ and $V$ ${\ displaystyle V}$ :
${\vec {W_{r}}}={\frac {{\vec {W_{u}}}+{\vec {W_{v}}}}{2}}$ ${\ displaystyle {\ vec {W_ {r}}} = {\ frac {{\ vec {W_ {u}}} + {\ vec {W_ {v}}} {2}}}$
Replace the connection between $U$ ${\ displaystyle U}$ and $V$ ${\ displaystyle V}$ in communication between $U$ ${\ displaystyle U}$ and $R$ ${\ displaystyle R}$ , $R$ ${\ displaystyle R}$ and $V$ ${\ displaystyle V}$ .
Reduce neuron errors $U$ ${\ displaystyle U}$ and $V$ ${\ displaystyle V}$ set the error value of the neuron $R$ ${\ displaystyle R}$ .
$E_{u}\Rightarrow E_{u}*a$ ${\ displaystyle E_ {u} \ Rightarrow E_ {u} * a}$
$E_{v}\Rightarrow E_{v}*a$ ${\ displaystyle E_ {v} \ Rightarrow E_ {v} * a}$
$E_{r}\Rightarrow E_{u}$ ${\ displaystyle E_ {r} \ Rightarrow E_ {u}}$
The great significance of this error is an indication that the corresponding neuron lies in the region of a small number of neurons.
Every time for a randomly selected $X$ ${\ displaystyle X}$ the neuron closest to it is determined ${\vec {W_{j}}}$ ${\ displaystyle {\ vec {W_ {j}}}}$ local error for last $E_{j}$ ${\ displaystyle E_ {j}}$ gets incremented $\left\|{\vec {W_{j}}}-{\vec {X}}\right\|^{2}$ ${\ displaystyle \ left \ | {\ vec {W_ {j}}} - {\ vec {X}} \ right \ | ^ {2}}$ .

Data Structure Form

The researcher can himself determine the shape of the cluster structure, whether clustering will be performed for the hypersphere, hypertube, or hyperplane. If he does not possess this knowledge, then due to the value of his own covariance matrix, the necessary form can be determined. If the structure has at least one eigenvalue less than the threshold chosen by the user, then the model will be hyperlinear, otherwise the structure must be considered as a nonlinear manifold. A further check will show if the model has the shape of a sphere or pipe. The check for sphericity depends on the fulfillment of the inequality np / na> ψ, where np is the number of vectors inside the cluster, which is found using the Jordan Brauer theorem ^[5] , and ap is the cluster surface area and ψ is the user-defined threshold. If this inequality takes the form np / na <ψ, then the form of the cluster will be a “hypertube”. ^[four]

The distance from the vector X to neurons in clusters of various shapes

For a cluster in the form of a hypertube, a radial measure of distance is calculated:

where Aj is a positive, definite matrix, calculated to take into account the eccentricity and orientation of the hypertube ^[6] . The value of Aj for this equation is found using the Looner hyperlipsoid using the Khachiyan algorithm ^[7] .

To determine distances in a hyperplane, use the following formula:

where Aj is an arbitrarily positively defined symmetric weight matrix. And bj, k is estimated by finding the eigenvectors of the neural nodes of the model.

To determine the distance in the hypersphere, you must use the formula:

where wi is either the average value of the vectors enclosed in the plane.

Data Visualization

In three-dimensional space, data is very easy to visualize. ^[4] You can see it in the picture.

3D rendering

However, if our space is larger than three-dimensional, then data visualization is difficult. To solve this problem, a technique based on VAT is used ^[8] . The essence of the construction is that there is a minimal spanning tree of the model. After the sorting process is completed, the cluster structure can be analyzed by squares near the diagonal. First, normalized, pairwise-different neurons are calculated in each isolated graph. Then, the distinguished neurons are rearranged in order to create the most dense intracluster distribution. Then each cluster is colored in its own color and placed along the main diagonal. Intracluster relationships are also included in the diagram, the maximum distance between two clusters is indicated by white, and the black is the smallest distance. The cluster volume can be added as another dimension, this is the height of the squares.

An example of using expanding neural gas

This example is presented to demonstrate how the system adapts when entering new data. The database is 1050 point objects. At the beginning, 5,000 iterations were performed and 75% of the information fell into the algorithm. After a small part - 756 data points were introduced into the system, neural vectors began to adapt to the formation of the distribution shown in the figure below.

Then another 150 new vectors were launched. This led to the formation of a new spherical class, indicated in the figure below:

Despite the spatial proximity of the green and purple clusters, the algorithm noted an increase in clusters and adapted to these changes. In this case, the remaining 120 objects were repeatedly mixed between the green and purple clusters. The algorithm subsequently distributed the data between the two clusters and retained the original number of clusters.

Notes

↑ Growing Neural Gas - Implementation in the MQL5 Programming Language
↑ Artificial neural networks , Artificial neural networks
↑ Neural.ru dictionary
↑ ¹ ² ³ Isaac J. Sledge, Growing Neural Gas for Temporal Clustering / IEEE, 2008
↑ M. Berg, M. Kreveld, M. Overmars, O. Schwarzkopf, Computational Geometry, Springer-Verlag, New York, 2000.
↑ G. Carpenter, “Competitive Learning: From Interactive Activation to Adaptive Resonance”, Cognitive Science, vol. 11, 1987.
↑ L. Khachiyan, M. Todd, “On the Complexity of Approximating the Maximal Inscribed Ellipsoid for a Polytope,” Math. Prog., 1993.
↑ J. Keller, I. Sledge, “A Cluster By Any Other Name,” IEEE Proc., NAFIPS, 2007.