The Self-Organizing Map – an Intro
October 20, 2011 § Leave a comment
A Map that organizes itself:
Is it the paradise for navigators, or is it the hell?
Well, it depends, I would say. As a control freak, or a warfarer like Shannon in the early 1940ies, you probably vote for the hell. And indeed, there are presumably only very few entities that have been so strongly neglected by information scientists like it was the case for the self-organizing map. Of course, there are some reasons for that. The other type of navigator probably enjoying the SOM is more likely of the type Odysseus, or Baudolino, the hero in Umberto Eco’s novel of the same name.
More seriously, the self-organizing map (SOM) is a powerful and even today (2011) a still underestimated structure, though meanwhile rapidly gaining in popularity. This chapter serves as a basic introduction into the SOM, along with a first discussion of the strength and weaknesses of its original version. Today, there are many versions around, mainly in research; the most important ones I will briefly mention at the end. It should be clear that there are tons of articles around in the web. Yet, most of them focus on the mathematics of the more original versions, but do not describe or discuss the architecture itself, or even provide a suitable high-level interpretation of what is going on in a SOM. So, I will not repeat the mathematics, instead I will try to explain it also for non-engineers without using mathematical formulas. Actually, the mathematics is not the most serious thing in it anyway.
The SOM is a bundle comprising a mathematical structure and a particularly designed procedure for feeding multi-dimensional (multi-attributes) data into it that are prepared as a table. Numbers of attributes can reach tens of thousands. Its purpose is to infer the best possible sorting of the data in a 2(3) dimensional grid. Yet, both preconditions, dimensionality and data as table, are not absolute and may be overcome by certain extensions to the original version of the SOM. The sorting process groups more similar records closer together. Thus we can say that a body of high-dimensional data (organized as records from a table) are mapped onto 2 dimensions, thereby enforcing a weighting of the properties used to describe (essentially: create) the records.
The SOM can be parametrized such that it is a very robust method for clustering data. The SOM exhibits an interesting duality, as it can be used for basic clustering as well as for target oriented predictive modeling. This duality opens interesting possibilities for realizing a pre-specific associative storage. The SOM is particularly interesting due to its structure and hence due to its extensibility, properties that other most methods do not share with the SOM. Though substantially different to other popular structures like Artificial Neural Networks, the SOM may be included into the family of connectionist models.
The development leading finally to the SOM started around 1973 in a computer science lab at the Helsinki university. It was Teuvo Kohonen who got aware to certain memory effects of correlation matrices. Until 1979, when he first published the principle of the Self-Organizing Map, he dedicatedly adopted principles known from the human brain. A few further papers followed and a book about the subject in 1983. Then, the SOM wasn’t readily adapted for at least 15 years. Its direct competitor for acceptance, the Backpropagation Artificial Neural network (B/ANN), was published in 1985, after the neural networks have been rediscovered in physics, following investigations of spin glasses and certain memory effects there. Actually, the interest in simulating neural networks dates back to 1941, when von Neumann, Hebb, McCulloch, and also Pitts, among others, met at a conference on the topic.
For a long time the SOM wasn’t regarded as a “neural network,” and this has been considered being a serious disadvantage. The first part of the diagnosis indeed was true: Kohonen never tried to simulate individual neurons, as it was the goal for all simulations of ANN. The ANN research has been deeply informed by physics, cybernetics and mathematical information theory. Simulating neurons is simply not adequate, it is kind of silly science. Above all, most ANN are just a very particular type of a “network” as there are no connections within a particular layer. In contrast, Kohonen tried to grasp a more abstract level: the population of neurons. In our opinion this choice is much more feasible and much more powerful as well. In particular, SOM can not only represent “neurons,” but any population of entities which can exchange and store information. More about that in a moment.
Nowadays, the methodology of SOM can be rated as well adopted. More than 8’000 research papers have been published so far, with increasing momentum, covering a lot of practical domains and research areas. Many have demonstrated the superiority or greater generality of SOM as compared to other methods.
The mechanism of a basic SOM is quite easy o describe, since there are only a few ingredients.
First, we need data. Imagine a table, where observations are listed in rows, and the column headers describe the variables that have been measured for each observed case. The variables are also called attributes, or features. Note, that in the case of the basic (say, the Kohonen-) SOM the structure given by the attributes is the same for all records. Technically, the data have to be normalized per column such that the minimum value is 0 and the maximum value is 1. Note, that this normalization ensures comparability of different sub-sets of observations. It represents just the most basic transformation of data, while there are many other transformation possible: logarithmic re-scaling of values of a column in order to shift the mode of the empirical distribution, splitting a variable by value, binarization, scaling of parameters that are available only on nominal niveau, or combining two or several columns by a formula are further examples (for details please visit the chapter about modeling). In fact, the transformation of data (I am not talking here about the preparation of data!) is one of the most important ingredients for successful predictive modeling.
Second, we create the SOM. Basically, and its simplest form, the SOM is a grid, where each cell has 4 (squares, rectangles) or 6 edges (hexagonal layout). The grid consists from nodes and edges. Nodes serve as a kind of container, while edges work as a kind of fibers for spreading signals. In some versions of the SOM the nodes can range freely or they can randomly move around a little bit.
An important element of the architecture of a SOM now is that each node gets the same structure assigned as we know from the table. As a consequence, the vectors collected in the nodes can easily be compared by some function (just wait a second for that). In the beginning, each node get randomly initialized. Then the data are fed into the SOM.
This data feeding is organized as follows. A randomly chosen record is taken from the table and then compared to all of the nodes. There is always a best matching node. The record then gets inserted into this node. Upon this insertion, which is kind of hiding, the values in the nodes structure vector are recalculated, e.g. as the (new) mean for all values across all records collected in a node (container). The trick now is not to change just the winning node where the data record has been inserted, but all nodes of the the close surround also, though with a strength that decreases with the distance.
This small activity of searching the best matching node, insertion and information spreading is done for all records, and possibly also repeated. The spreading of information to the neighbor nodes is a crucial element in the SOM mechanism. This spreading is responsible for the self-organization. It also represents a probabilistic coupling in a network. Of course, there are some important variants to that, see below, but basically that’s all. Ok, there is some numerical bookkeeping, optimizations to search the winning nodes etc. but these measures are not essential for the mechanism.
As a result one will find similar records in the same node, or a the direct neighbors. It has been shown that the SOM is topology preserving, that is, the SOM is smooth with regard to the similarity of neighbor nodes. The data records inside the nodes are a list, which is described by the node’s value vector. That value vector could be said to represent a class, or intension, which is defined by its empirical observations, the cases, or extension.
After feeding all data to the SOM the training has been finished. For SOM, it is easy to run in a continuous mode, where the feed of incoming data is not “stopping” at any time. Now the SOM can be used to classify new records. A new record simply needs to be compared to the nodes of the SOM, i.e. to the value vector of the nodes, but NOT to all the cases (SOM is not case-based reasoning, but type-based reasoning!). If the records contained a marker attribute, e.g. indicating the quality of the record, you will also get the expected quality for a new record of unknown quality.
Properties of the SOM
The SOM belongs to the class of clustering algorithms. It is very robust against missing values in the table, and unlike many other methods it does NOT require any settings regarding the clusters, such as size or number. Of course, this is a great advantage and a property of logical consistency. Nodes may remain empty, while the node value vector of the empty node is well-defined. This is a very important feature of the SOM, as this represents the capability to infer potential yet unseen observations. No other method is able to behave like this. Other properties can be invoked by means of possible extensions of the basic mechanism (see below)
As already said, nodes collect similar records of data, where a record represents a single observation. It is important to understand, that a node does not equal to a cluster. In our opinion, it does not make much sense to draw boundaries around one or several nodes and so proposing a particular “cluster.” This boundary should be set only upon an external purpose. Inversely, without such a purpose, it is sense-free to conceive of a trained SOM as a model. AT best, it would represent a pre-specific model, which however is a great property of the SOM to be able to create such.
The learning is competitive, since different nodes compete for a single record. Yet, it is also cooperative, since the upon an insert operation information is exchanged between neighbor nodes.
The reasoning of the SOM is type-based, which is much more favorable than case-based reasoning. It is also more flexible than ANN, which just provide a classification, but no distinction between extension and intension is provided. SOM, but not ANN, can be used in two very different modes. Either just for clustering or grouping individual observations without any further assumptions, and secondly for targeted modeling, that is for establishing a predictive/ diagnostic link between several (or many) basic input variables and one (or several) target variable(s) that represent the outcome of a process according to experience. Such a double usage of the same structure is barely accessible for any other associative structure.
Another difference is that ANN are much more suitable to approximate single analytic functions, while SOM are suitable for general classification tasks, where the parameter space and/or the value (outcome) space could even be discontinuous or folded.
A large advantage over many other methods is that the similarity function and the cost function is explicitly accessible. For ANN, SVM or statistical learning this is not possible. Similarly, the SOM automatically adapts its structure according to the data, i.e. it is also possible to change the method within the learning process, adaptively and self-organized.
As a result we can conclude that the design of the SOM method is much more transparent than that of any other of the advanced methods.
Competing Comparable Methods
SOM are either more robust, more general or more simple than any other method, while the quality of classification is closely comparable. Among those competing methods are artificial neural networks (ANN), principal component analysis (PCA), multi-dimensional scaling (MDS), or adaptive resonance theory network (ART). Important ideas of ART networks can be merged with the SOM principle, keeping the benefits of both. PCA and MDS are based on statistical correlation analysis (covariance matrices), i.e. they are importing all the assumptions and necessary precondition of statistics, namely the independence of observational variables. Yet, it is the goal to identify such dependencies, thus it is not quite feasible to presuppose that! SOM do not know such limitations from strange assumptions; else, recently it has been proven that SOM are just a generalization of PCA.
Of course, there are many other methods, like Support Vector Machines (SVM) with statistical kernels, or tree forests; yet, these methods are purely statistical in nature, with no structural part in it. Else, they do not provide access to the similarity function as it is possible for the SOM.
A last word about the particular difference between ANN and SOM. SOM are true symmetrical networks, where each unit has its own explicit memory about observations, while the linkage to other units on the same level of integration is probabilistic. That means, that the actual linkage between any two units can be changed dynamically within the learning process. In fact, a SOM is thus not a single network like a fisher net, it is much more appropriate to conceive them as a representation of a manifold of networks.
Contrary to those advanced structural properties, the so-called Artificial Neural Networks are explicit directional networks.Units represent individual neurons and do not have storage capacities. Each unit does not know anything about things like observations. Conceptually, these units are thus on a much lower level than the units in a SOM. In ANN they can not have “inner” structure. The same is true for for the links between the units. Since they have to be programmed in an explicit manner (which is called “architecture”), the topology of the connections can not be changed during learning at runtime of the program.
In ANN information flows through the units in a directed manner (as in case of natural neurons). It is there almost impossible to create an equally dense connectivity within a single layer of units as in SOM. As a consequence, ANN do not show the capability for self-organization.
Taken as whole, ANN seem to be under the representationalist delusion. In order to achieve the same general effects and abstract phenomena as the SOM are able to, very large ANN would be necessary. Hence, pure ANN are not really a valid alternative for our endeavor. This does not rule out the possibility to use them as components within a SOM or between SOMs.
Variants and Architectures
Here are some SOM extensions and improvements of the SOM.
Homogenized Extensional Diversity
The original version of the SOM tends to collect “bad” records, those not matching well anywhere else, into a single cluster, even if the records are not similar at all. In this case it is not allowed to compare nodes any more, since the internal variance is not comparable anymore and the mean/variance on the level of the node would not describe the internal variance on the level of the collected records any more. The cure for that misbehavior is rather simple. The cost function controlling the matching of a record to the most similar node needs to contain the variability within the set of records (extension of the type represented by the node) collected by the node. Else, merging and splitting of nodes as described for structural learning helps effectively. In scientific literature, there is yet no reference for this extension of the SOM.
One of the most basic extensions to the original mechanism is to allow for splitting and merging of nodes according to some internal divergence criterion. A SOM made from such nodes is able to adopt structurally to the input data, not just statistically. This feature is inspired by the so-called ART-networks . Similarly, merging and splitting of “nodes” of a SOM was proposed by , though not in the cue of ART networks.
Since the SOM represents populations of neurons, it is easy and straightforward to think about nesting of SOM. Each node would contain a smaller SOM. A node may even contain any other parametrized method, such like Artificial Neural Networks. The node value vector then would not exhibit the structure of the table, but instead would display the parameters of the enclosed algorithm. One example for this is the so-called mn-SOM .
Usually, data are not evenly distributed. Hence, some nodes grow much more than others. One way to cope with this situation is to automatically let the SOM grow. many different variants of growth could be thought of and some already has been implemented. Our experiments point into the direction of a “stem cell” analog.
Growing SOMs have first been proposed by , while  provides some exploratory implementation. Concerning growing SOM, it is very important to understand the concept (or phenomenon) of growth. We will discuss possible growth patterns and the consequences for possible and new growing SOMs elsewhere. Just for now we can say that any kind of SOM structure can grow and/or differentiate.
SOM gas, mobile nodes
The name already says it: the topology of the grid making up the SOM is not fixed. Nodes even may range around quite freely, as in the case of SOM Gas.
Starting from mobile nodes, we can think about a small set of properties of nodes which are not directly given by the data structure. These properties can be interpreted as chemicals creating sth. like a Gray-Scott Reaction-Diffusion-Model, i.e. a self-organizing fluid dynamics. The possible effects are (i) a differential opacity of the SOM for transmitted information, or (ii) the differentiation into fibers and networks, or (iii) the optimization of the topological structure as a standard part of the life cycle of the SOM. The mobility can be controlled internally by means of a “temperature,” or expressed by a metaphor, the fixed SOM would melt partially. This may help to reorganize a SOM. In scientific literature, there is yet no reference for this extension of the SOM.
Evolutionary Embedded SOM with Meta-Modeling
SOM can be embedded into an evolutionary learning about the most appropriate selection of attributes. This can be extended even towards the construction of structural hypothesis about the data. While other methods could be also embedded in a similar manner, the results are drastically different, since most methods do not learn structurally. Coupling evolutionary processes with associative structures was proposed a long time ago by , albeit only in the context of optimization of ANN. While this is quite reasonable, we additionally propose to use evolution in a different manner and for different purposes (see the chapter about evolution)
 ART networks
 merging splitting of nodes
 The mn-SOM
 Growing SOM a
 Growing SOM b
 evolutionary optimization of artificial neural networks