SOM = Speedily Organizing Map
February 12, 2012 § Leave a comment
The Self-organizing Map is a powerful and high-potential computational procedure.
Yet, there is no free lunch, especially not for procedures that are able to deliver meaningful results.
The self-organizing map is such a valuable procedure, we have discussed its theoretical potential with regard to a range of different aspects in other chapters. Here, we want not to deal further with such theoretical or even philosophical issues, e.g. related to the philosophy of mind, instead we focus the issue of performance, understood simply in terms of speed.
For all those demo SOMs the algorithmic time complexity is not really an issue. The algorithm approximates rather quickly to a stable state. Yet, small maps—where “small” means something like less than 500 nodes or so—are not really interesting. It is much like in brains. Brains are made largely from neurons and some chemicals, and a lot of them can do amazing things. If you take 500 of them you may stuff a worm in an appropriate manner, but not even a termite. The important questions thus are, beyond the nice story about theoretical benefits.
What happens with the SOM principle if we connect 1’000’000 nodes?
How to organize 100, 1000 or 10’000 of such million-nodes SOMs?
By these figures we would end up with somewhat around 1..10 billions of nodes1, all organized along the same principles. Just to avoid a common misunderstanding here: these masses of neurons are organized in a very similar manner, yet the totality of them builds a complex system as we have described it in our chapter about complexity. There are several, if not many emergent levels, and a lot of self-referentiality. These 1 billion nodes are not all engaged with segmenting external data! We will see elsewhere, in the chapter about associative storage and memory, how such deeply integrated modular system could be conceived of. There are some steps to take, though not terribly complicated or difficult ones.
When approaching such scales, the advantage of the self-organization turns palpably into a problematic disadvantage. “Self-organizing” means “bottom-up,” and this bottom-up direction in SOMs is represented by the fact that all records representing the observations have repeatedly to be compared to all nodes in the SOM in order to find the so-called “best matching unit” (BMU). The BMU is just that node in the network that exhibits an intensional profile that is the most similar among all the other profiles2. Though the SOM avoids to compare all records to all records, its algorithmic complexity scales as a power-function with respect to its own scale! Normally, algorithms are dependent on the size of the data, but not to its own “power.”
In its naive form the SOM shows a complexity of something like O(n w m2), where n is the amount of data (number of records, size of feature set), w the number of nodes to be visited for searching the BMU, and m2 the number of nodes affected in the update procedure. w and m are scaled by factors f1,f2 <1, but the basic complexity remains. The update procedure affects an area that is dependent on the size of the SOM, therefore the exponent. The exact degree of algorithmic complexity is not absolutely determined, since it depends on the dynamics in the learning function, among other things.
The situation worsens significantly if we apply improvements to the original flavor of the SOM, e.g.
- – the principle of homogenized variance (per variable across extensional containers),
- – in targeted modeling, tracking the explicit classification performance per node on the level of records, which means that the data has to be referenced
- – size balancing of nodes,
- – morphological differentiation like growing and melting, as in the FluidSOM, which additionally allows for free ranging nodes,
- – evolutionary feature selection and creating proto-hypothesis,
- – dynamic transformation of data,
- – then think about the problem of indeterminacy of empiric data, which enforces differential modeling, i.e. a style of modeling that is performing experimental investigation of the dependency of the results on the settings (the free parameters) of the computational procedure: sampling the data, choosing the target, selecting a resolution for the scale of the resulting classification, choosing a risk attitude, among several more.
All affects the results of modeling, that is the prognostic/diagnostic/semantic conclusions one could draw from the modeling. Albeit all these steps could be organized based on a set of rules, including applying another instance of a SOM, and thus could be run automatically, all of these necessary explorations require separate modeling. It is quite easy to set up an exploration plan for differential modeling that would require several dozens of models, and if evolutionary optimization is going to be applied, 100s if not thousands of different maps have to be calculated.
Fortunately, the SOM offers a range of opportunities for using dynamic look-up tables and parallel processing. A SOM consisting of 1’000’000 neurons could easily utilize several thousand threads, without much worries about concurrency (or the collisions of parallel threads). Unfortunately, such computers are not available yet, but you got the message…
Meanwhile we have to apply optimization through dynamically built look-up tables. These I will describe briefly in the following sections.
Searching the Most Similar Node
An ingredient part of speeding up the SOM in real-life application is an appropriate initialization of the intensional profiles across the map. Of course, precisely this can not be known in advance, at least not exactly. The self-organization of the map is the shortest path to its final state, there is no possibility for an analytic short-cut. Kohonen proposes to apply Principal Component Analysis (PCA) for calculating the initial values. I am convinced that this is not a good idea. The PCA is deeply incompatible with the SOM, hence it will respond to very different traits in the data. PCA and SOM behave similar only in the case of demo cases…
Preselecting the Landing Zone
A better alternative is the SOM itself. Since the mapping organized by the SOM is preserving the topology of the data, we could apply a much smaller SOM, even a nested series of down-scaled SOMs to create a coarse model for selecting the appropriate sub-population in the large SOM. The steps are the following:
- 1. create the main SOM, say 40’000 nodes, organized on a square rectangle, where the sides are of the relative length of 200 nodes;
- 2. create a SOM for preselecting the landing zone, scaled approximately 14 by 14 nodes, and use the same structure (i.e. the same feature vectors) as for the large SOM;
- 3. Prime the small SOM with a small but significant sample of the data, which comprise say 2000..4000 records in this case of around 200 nodes; draw this sample randomly from the data; this step should complete comparatively quick (by a factor of 200 in our example);
- 4. initialize the main SOM by a blurred (geometric) projection of the intensional profiles from the minor to the larger SOM;
- 5. now use the minor SOM as a model for the selection of the landing zone, simply by means of geometric projection.
As a result, the number of nodes to be visited in the large SOM in order to find the best match remains almost constant.
There is an interesting correlate to this technique. If one needs a series of SOM based representations of the data distinguished just by the maximum number of nodes in the map, one should always start with the lowest, i.e. most coarse resolution, with the least number of nodes. The results then can be used as a projective priming of the SOM on the next level of resolution.
Explicit Lookup Tables linking Observations to SOM areas
In the previous section we described the usage of a secondary much smaller SOM as a device for limiting the number of nodes to be scanned. The same problematics can be addressed by explicit lookup tables that establish a link between a given record and a vicinity around the last (few) best matching units.
If the SOM is approximately stable, that is, after the SOM has seen a significant portion of the data, it is not necessary any more to check the whole map. Just scan the vicinity around the last best matching node in the map. Again, the number of nodes necessary to be checked is almost constant.
The stability can not be predicted in advance, of course. The SOM is, as the name says, self-organizing (albeit in a weak manner). As a rule of thumb one could check the average number of observations attached to a particular node, the average taken across all nodes that contain at least one record. This average filling should be larger than 8..10 (due to considerations based on variance, and some arguments derived from non-parametric statistics… but it is a rule of thumb).
Large Feature Vectors
Feature vectors can be very large. In life sciences and medicine I experienced cases with 3000 raw variables. During data transformation this number can increase to 4000..6000 variables. Te comparison of such feature vectors is quite expensive.
Fortunately, there are some nice tricks, which are all based on the same strategy. This strategy comprises the following steps.
- 1. create a temporary SOM with the following, very different feature vector; this vector has just around 80..100 (real-valued) positions and 1 position for the index variable (in other words, the table key); such the size of the vector is a 60th of the original vector, if we are faced with 6000 variables.
- 2. create the derived vectors by encoding the records representing the observation by a technique that is called “random projection”; such a projection is generated by multiplying the data vector with a token from a catalog of (labeled) matrices, that are filled with uniform random numbers ([0..1]).
- 3. create the “random projection” SOM based on these transformed records
- 4. after training, replace the random projection data with real data, re-calculate the intensional profiles accordingly, and run a small sample of the data through the SOM for final tuning.
The technique of random projection has been invented in 1988. The principle works because of two facts:
- (1) Quite amazingly, all random vectors beyond a certain dimensionality (80..200, as said before) are nearly orthogonal to each other. The random projection compresses the original data without loosing the bits of information that are distinctive, even if they are not accessible in an explicit manner any more.
- (2) The only trait of the data that is considered by the SOM is their potential difference.
Bags of SOMs
Finally, one can take advantage of splitting the data into several smaller samples. These samples require only smaller SOMs, which run much faster (we are faced with a power law). After training each of the SOMs, they can be combined into a compound model.
This technique is known as bagging in Data Mining. Today it is also quite popular in the form of so-called random forests, where instead of one large decision tree man smaller ones are built and then combined. This technique is very promising, since it is a technique of nature. Its simply modularization on an abstract level, leading the net level of abstraction in a seamless manner. It is also one of our favorite principles for the overall “epistemic machine”.
1. This would represent just around 10% of the neurons of our brain, if we interpret each node as a neuron. Yet, this comparison is misleading. The functionality of a node in a SOM rather represents a whole population of neurons, although there is no 1:1 principle transferable between them. Hence, such a system would be roughly of the size of a human brain, and much more important, it is likely organized in a comparable, albeit rather alien, manner.
2. Quite often, the vectors that are attached to the nodes are called weight vectors. This is a serious misnomer, as neither the nodes are weighted by this vector (alone), nor the variables that make up that vector (for more details see here). Conceptually it is much more sound to call those vectors “intensional profiles.” Actually, one could indeed imagine a weight vector that would control the contribution (“weight”) of variables to the similarity / distance between two of such vectors. Such weight vectors could even be specific for each of the nodes.