December 19, 2011 § Leave a comment
Initially, the meaning of ‘associativity’ seems to be pretty clear.
According to common sense, it denotes the capacity or the power to associate entities, to establish a relation or a link between them. Yet, there is a different meaning from mathematics that almost appears as kind of a mocking of the common sense. Due to these very divergent meanings we first have to clarify our usage before discussing the concept.
A Strange Case…
In mathematics, associativity is defined as a neutrality of the results of a compound operation with respect to the “bundling,” or association, of the individual parts of the operation. The formal statement is:
A binary operation ∘ (relating two arguments) on a set S is called associative if it satisfies the associative law:
x∘(y∘z) = (x∘y)∘z for all x, y, z ∊ S
This, however, is just the opposite of “associative,” as it demands the independence from any particular association. If there would be any capacity to establish an association between any two elements of S, then there should not be any difference.
Maybe, some mathematician in the 19th century hated the associative power of so many natural structures. Subsequently, modernism contributed its own part to establish the corruption of the obvious etymological roots.
In mathematics the notion of associativity—let us call it I-associativity in order to indicate the inverted meaning—is an important part of fundamental structures like “classic” (Abelian) groups or categories.
Groups are important since they describe the basic symmetries within the “group” of operations that together form an algebra. Groups cover anything what could be done with sets. Note that the central property of sets is their enumerability. (Hence, a notion of “infinite” sets is nonsense; it simply contradicts itself.) Yet, there are examples of quite successful, say: abundantly used, structures that are not based on I-associativity, the most famous of them being the Lie-group. Lie-groups allow to conceive of continuous symmetry, hence it is much more general than the Abelian group that essentially emerged from the generalization of geometry. Even in the case of Lie-groups or other “non-associative” structures, however, the term refers to the meaning such as to inverting it.
With respect to categories we can say that so far, and quite unfortunately, there is not yet something like a category theory that would not rely on I-associativity, a fact that is quite telling in itself. Of course, category theory is also quite successful, yet…
Well, anyway, we would like to indicate that we are not dealing with I-associativity here in this chapter. In contrast, we are interested in the phenomenon of associativity as it is indicated by the etymological roots of the word: The power to establish relations.
A Blind Spot…
In some way the particular horror creationes so abundant in mathematics is comprehensible. If a system would start to establish relations it also would establish novelty by means of that relation (sth. that simply did not exist before). So far, it was not possible for mathematics to deal symbolically with the phenomenon of novelty.
Nevertheless it is astonishing that a Google raid on the term “associativity” reveals only slightly more than 500 links (Dec. 2011), from which the vast majority consists simply from the spoofed entry in Wikipedia that considers the mathematical notion of I-associativity. Some other links are related to computer sciences, which basically refer to the same issue, just sailing under a different flag. Remarkably, only one (1) single link from an open source robotics project  mentions associativity as we will do here.
Not very surprising one can find an intense linkage between “associative” and “memory,” though not in the absolute number of sources (also around ~600), but in the number of citations. According to Google scholar, Kohonen and his Self-Organizing Map  is being cited 9000+ times, followed by Anderson’s account on human memory , accumulating 2700 citations.
Of course, there are many entries in the web referring to the word “associative,” which, however, is an adjective. Our impression is that the capability to associate has not made its way into a more formal consideration, or even to regard it as a capability that deserves a dedicated investigation. This deficit may well be considered as a continuation of a much older story of a closely related neglect, namely that of the relation, as Mertz pointed out [4, ch.6], since associativity is just the dynamic counterpart of the relation.
Formal and (Quasi-)Material Aspects
In a first attempt, we could conceive of associativity as the capability to impose new relations between some entities. For Hume (in his “Treatise”, see Deleuze’s book about him), association was close to what Kant later dubbed “faculty”: The power to do sth, and in this case to relate ideas. However, such wording is inappropriate as we have seen (or: will see) in the chapters about modeling and categories and models. Speaking about relations and entities implies set theory, yet, models and modeling can’t be covered by set theory, or only very exceptionally so. Since category theory seems to match the requirements and the structure of models much better, we also adapt its structure and its wording.
Associativity then may be taken as the capability to impose arrows between objects A, B, C such that at least A ⊆ B ⊆ C, but usually A ⋐ B ⋐ C, and furthermore A ≃ C, where “≃” means “taken to be identical despite non-identity”. In set theoretic terms we would have used the notion of the equivalence class. Such arrows may be identified with the generalized model, as we are arguing in the chapter about the category of models. The symbolized notion of the generalized abstract model looks like this (for details jump over to the page about modeling):
Those arrows representing the (instances of a generalized) model are functors that are mediating between categories. We also may say that the model imposes potentially a manifold of partially ordered sets (posets) onto the initial collection of objects.
Now we can start to address our target, the structural aspects of associativity, more directly. We are interested in the necessary and sufficient conditions for establishing an instance of an object that is able (or develops the capability) to associate objects in the aforementioned sense. In other words, we need an abstract model for it. Yet, here we are not interested in the basic, that is transcendental conditions for the capability to build up associative power.
Let us start more practically, but still immaterial. The best candidates we can think of are Self-Organizing Maps (SOM) and particularly parameterized Reaction-Diffusion Systems (RDS); both of them can be subsumed into the class of associative probabilistic networks, which we describe in another chapter in more technical detail. Of course, not all networks exhibit the emergent property of associativity. We may roughly distinguish between associative networks and logistic networks . Both, SOM as well as RDS, are also able to create manifolds of partial orderings. Another example from this family is the Boltzmann engine, which, however, has some important theoretical and practical drawbacks, even in its generalized form.
Next, we depict the elementary processes of SOM and RDS, respectively. SOM and RDS can be seen as instances located at the distant endpoints of a particular scale, which expresses the topology of the network. The topology expresses the arrangement of quasi-material entities that serve as persistent structure, i.e. as a kind of memory. In the SOM, these entities are called nodes and they are positioned in a more or less fixed grid (albeit there is a variant of the SOM, the SOM gas, where the grid is more fluid). The nodes do not range around. In contrast to the SOM, the entities of an RDS are freely floating around. Yet, RDS are simulated much like the SOM, assuming cells in a grid and stuffing them with a certain memory.
Inspecting those elementary processes, we of course again find transformations. More important, however, is another structural property to both of them. Both networks are characterized by a dynamically changing field of (attractive) forces. Just the locality of those forces is different between SOM and RDS, leading to a greater degree of parallelity in RDS and to multiple areas of the same quality. In SOMs, each node is unique.
The forces in both types of networks are, however, exhibiting the property of locality, i.e. there is one or more center, where the force is strong, and a neighborhood that is established through a stochastic decay of the strength of this force. Usually, in SOM as well as in RDS, the decay is assumed to be radially symmetric, but this is not a necessary condition.
After all, are we now allowed to ask ‘Where does this associativity come from?’ The answer is clearly ‘no.’ Associativity is a holistic property of the arrangement as a total. It is the result of the copresence of some properties like
- – stochastic neighborhoods that are hosting an anisotropic and monotone field of forces;
- – a certain, small memory capacity of the nodes; note that the nodes are not “points”: in order to have a memory they need some corporeality. In turn this opens the way to think of a separation of of the function of that memory and a variable host that provides a container for that memory.
- – strong flows, i.e. a large number of elementary operations acting on that memory, producing excitatory waves (long-range correlations) of finite velocity;
The result of the interaction of those properties can not be described on the level of the elements of the network itself, or any of its parts. What we will observe is a complex dynamics of patterns due to the superposition of antagonist forces, that are modeled either explicitly in the case of RDS, or more implicitly in the case of SOM. Thus both networks are also presenting the property of self-organization, though this aspect is much more dominantly expressed in RDS as compared to the SOM. The important issue is that the whole network, and even more important, the network and its local persistence (“memory”) “causes” the higher-level phenomenon.
We also could say that it is the quasi-material body that is responsible for the associativity of the arrangement.
The Power of a Capability
So, what is this associativity thing about? As we have said above, associativity imposes a potential manifold of partial orderings upon an arbitrary open set.
Take a mixed herd of Gnus and Zebras as the open set without any particular ordering, put some predators like hyenas or lions into this herd, and you will get multiple partially ordered sub-populations. In this case, the associativity emerges through particular rules of defense, attack and differential movement. The result of the process is a particular probabilistic order, clearly an immaterial aspect of the herd, despite the fact that we are dealing with fleshy animals.
The interesting thing in both the SOM and the RDS is that a quasi-body provides a capability that transforms an immaterial arrangement. The resulting immaterial arrangement is nothing else than information. In other words, something specific, namely a persistent contrast, has been established from some larger unspecific, i.e. noise. Taking the perspective of the results, i.e. with respect to the resulting information, we always can see that the association creates new information. The body, i.e. the materially encoded filters and rules, has a greater weight in RDS, while in case of the SOM the stabilization aspect is more dominant. In any case, the associative quasi-body introduces breaks of symmetry, establishes them and stabilizes them. If this symmetry breaking is aligned to some influences, feedback or reinforcement acting from the surrounds onto the quasi-body, we may well call the whole process (a simple form of) “learning.”
Yet, this change in the informational setup of the whole “system” is mirrored by a material change in the underlying quasi-body. Associative quasi-bodies are therefore representatives for the transition from the material to the immaterial, or in more popular terms, for the body-mind-dualism. As we have seen, there is no conflict between those categories, as the quasi-body showing associativity provides a double-articulating substrate for differences. Else, we can see that these differences are transformed from a horizontal difference (such as 7-5=2) into vertical, categorical differences (such like the differential). If we would like to compare those vertical differences we need … category theory! …or a philosophy of the differential!
Early in the 20th century, the concept of association has been adopted by behaviorism. Simply recall the dog of Pavlov and the experiments of Skinner and Watson. The key term in behaviorism as a belated echo of 17th century hyper-mechanistics (support of a strictly mechanic world view) is conditioning, which appears in various forms. Yet, conditioning always remains a 2-valued relation, practically achieved as an imprinting, a collision between two inanimate entities, despite the wording of behaviorists who equate their conditioning with “learning by association.” What should learning be otherwise? Nevertheless, behaviorist theory commits the mistake to think that this “learning” should be a passive act. As you can see here, psychologists still strongly believe in this weird concept. They write: “Note that it does not depend on us doing anything.” Utter nonsense, nothing else.
In contrast to imprinting, imposing a functor onto an open set of indeterminate objects is not only an exhausting activity, it is also a multi-valued “relation,” or simply, a category. If we would analyze the process of imprinting, we would find that even “imprinting” can’t be covered by a 2-valued relation.
Nevertheless, other people took the media as the message. For instance, Steven Pinker criticized the view that association is sufficient to explain the capability of language. Doing so, he commits the same mistake as the behaviorists, just from the opposite direction. How else should we acquire language, if not by some kind of learning, even if it is a particular type of learning? The blind spot of Pinker seems to be randomization, i.e. he is not able leave the actual representation of a “signal” behind.
Another field of application for the concept of associativity is urban planning or urbanism, albeit associativity is rarely recognized as a conceptual or even as a design tool. [cf. 6] It is obvious that urban environments can be conceived as a multitude of high-dimensional probabilistic networks .
Machines, Machines, Machines, ….Machines?
Associativity is a property of a persistent (quasi-material) arrangement to act onto a volatile stream (e.g. information, entropy) in such a way as to establish a particular immaterial arrangement (the pattern, or association), which in turn is reflected by material properties of the persistent layer. Equivalently we may say that the process leading to an association is encoded into the material arrangement itself. The establishment of the first pattern is the work of the (quasi-)body. Only for this reason it is possible to build associative formal structures like the SOM or the RDS.
Yet, the notion of “machine” would be misplaced. We observe strict determinism only on the level of the elementary micro-processes. Any of the vast number of individual micro-events are indeed uniquely parameterized, sharing only the same principle or structure. In such cases we can not speak of a single machine any more, since a mechanic machine has a singular and identifiable state at any point in time. The concept of “state” does neither hold for RDS nor for SOM. What we see here is much more like a vast population of similar machines, where any of those is not even stable across time. Instead, we need to adopt the concept of mechanism, as it is in use in chemistry, physiology, or biology at large. Since both principles, SOM and RDS, show the phenomenon of self-organization, we even can not say that they represent a probabilistic machine. The notion of the “machine” can’t be applied to SOM or RDS, despite the fact that we can write down the principles for the micro-level in simple and analytic formulas. Yet, we can’t assume any kind of a mechanics for the interaction of those micro-machines.
It is now exciting to see that a probabilistic, self-organizing process used to create a model by means of associating principles looses the property of being a machine, even as it is running on a completely deterministic machine, the simulation of a Universal Turing Machine.
Associativity is a principle that transcends the machine, and even the machinic (Guattari). Assortative arrangements establish persistent differences, hence we can say that they create proto-symbols. Without associativity there is no information. Of course, the inverse is also true: Wherever we find information or an assortment, we also must expect associativity.
-  iCub
-  Kohonen, Teuvo, Self-Organization and Associative Memory. Springer Series in Information Sciences, vol.8, Springer, New York 1988.
-  Anderson J.R., Bower G.H., Human Associative Memory. Erlbaum, Hillsdale (NJ) 1980.
-  Mertz, D. W., Moderate Realism and its Logic, New Haven: Yale 1996.
-  Wassermann, K. (2010), Associativity and Other Wurban Things – The Web and the Urban as merging Cultural Qualities. 1st international workshop on the urban internet of things, in conjunction with: internet of things conference 2010 in Tokyo, Japan, Nov 29 – Dec 1, 2010. (pdf)
-  Dean, P., Rethinking representation. the Berlage Institute report No.11, episode Publ. 2007.
-  Wassermann, K. (2010). SOMcity: Networks, Probability, the City, and its Context. eCAADe 2010, Zürich. September 15-18, 2010. (pdf)
November 11, 2011 § Leave a comment
We may put it simply, and—we are quite sure—everybody will
agree upon that: Everything is moving, spinning, jumping, turning, winking, on any level, from the electrons, to the galaxies, from molecules and plants to animals and humans. Yet, even the founders of philosophy, those demigods from classic Greece, got trapped by the idea—or even more appropriate: ideology—of stasis, which in the beginning was the idea of the idea.
Throughout the history of culture there is a salient trace of that ideology. From said idea to Archimedes’ linchpin, from the silly idea of the earth as the centerpoint of the universe to the idea of the universal itself, or quite recently, to the idea of the state, which has been claimed as a proper concept for dealing with language and mind in thousands of publications. You find it in Hegel, everywhere there, not in Darwin or Nietzsche, again in the territorialism of Heidegger (should I say terrorialism?), but neither in Wittgenstein’s nor in Deleuze’s thought, whose whole oeuvres were directed strictly against any kind of stasis, territory, state or universal. Everything is flight, escape, series, event, and logics is transcendental, if there is something as logics at all. Above all and beyond any other, of course, so-called analytic philosophy, particularly that which once originated in German culture (though near Vienna) is still a proponent of the stasis, whether or not they think about the mind (and the brain) or not. Since it has been the program of whole modernism to expel time from the world view, it does not come as a surprise that Neuroscience as well as computer science forgot about the movement.
But it does move. What? Everything, not just the earth as Galileo was so eager to popularize. Actually, we even should not first put an object there (or a species, a gen, an idea) and only then, as a second step, asking how does it came about. We know quite well about the worries of Zenon and the limitations of Newton’s physics. It is indeed a radical move to put movement and change as the primary entity. It was radical in physics, biology, and it will be even more radical in “soft” sciences like libuistics or cognitive psychology, even in philosophy.
The question at the core of understanding the “world” therefore is about the transition from the moving, the vortices, the clinamen, the indeterminate to their territorial counterpart, the object, the symbol, the word, the proposition. It would be too bold to call them (semi-)illusions, probably; yet, one could find quite some arguments to support that.
Of course, throughout history there always have been people emphasizing the primacy of the open transition, Lucretius, Ovid, Whitehead (but not Marx, of course), Serres. They do not, however, form any part of the mainstream of contemporary philosophy.
So, here still as a (well-founded) suggestion, we could say, that van Fraassen’s question is upside down in the same way as Minsky’s or Clancey’s “frame of reference.” We should NOT ask about how words acquire reference, but instead about how the sheafed stream of references exudes and secretes words. In the beginning there is not the word, in the beginning there is just the (associativity of) bodies.
To put it still more exactly, we should not ask about the applicability of logics in the world, but instead about the transition from the probabilistic to the propositional. This holds even for categories like category or relations. If one would take category theory with categories as the quasi-objects, nothing would have been gained. What is nice about category theory is its abstractness, yet the arrows (“transitions”) have to be randomized, or represented as probabilistic functions, similar to Dirac’s Delta: the probability density can’t be a well-defined one, there is a cascaded, higher-order indeterminacy.
The transition from the probabilistic to the propositional is basically a movement, since it involves bodilyness. It would be a mistake to conceive of that transition as a purely formal one. In an important sense it is also a (deep) synthesis, a construction. Note that this holds for any perception, on whatsoever level you’d like to choose. From very similar reasons, Putnam called \ˌa-nə-ˈli-tik\ (pronunciation of “analytic”) an inexplicable noisy sound .
There is a wealth of corollaries here, which we can’t dig into. Yet, it is very clear to us that this transition is near a transcendental category, probably even before space and time. As such, it is also one of the primary architectural (though quite abstract) principles for our undertaking of a machine-based epistemology here. An instance of this transition can be found in the relationship of information and causality.
One of those corollaries is represented by a whole cluster of the ill-posed questions about the mind-body-problem. We would not deny that there is an important difference (now opposing cognitivism, the computational theory of mind, modern neuroscience, etc.), and that it is important to think about that difference. Yet it is not a problem. The concept that makes this problem disappear is exactly the formulation of the question about the transition from the probabilistic to the propositional. At the times of Descartes whose work paved the way for this pseudo-problem, everything had been conceived as mechanical machines. Information was unknown, computers not available, even a concept of probabilistic networks or computational structures endowed with associative power far beyond any intellectual reach.
The big question here is… well, actually it is not sooo big, how to transfer this insight into a software system. Concerning our population of glued SOMs the simple question is: How to feed them? In less “metaphorical” style—though it is not that metaphorical at all—we (as programmers) have the task to decide about the way we present “information” to the SOM and how we introduce it to it.
Whatsoever the answer will be, the answer does not contain the “symbol.” We would be trapped by the “fallacy of the symbolic,” and concerning our reasoning we would commit a petitio principii: if we put symbols into the concept (or a body) in its very beginning, it is quite likely that we will not find any other thing than symbols thereafter (or destructive “secondary” chaos). It will not solve the problem where the symbols come from. Undeniably, however, we use digital computers and quite obviously also a symbol-based instruction coding system (“programming language”). How then to present information in a non-symbolic manner?
The answer is: by probabilization. We should not think that it is possible to present “facts” to the machine. You may remember the failure of the logics-oriented AI, the Edinburgh-school and their Prolog initiative. Instead, we have to present “probable contexts.” Of course, we have to define the concept of context such that it becomes operable, and again we use symbols for that. But this can be accomplished in a manner compatible to the probabilistic perspective. Any observational act could be conceived of as an interpretation of certain more or less anisotropic and regular changes of energy density. Such a description is almost purely physical. We are definitely on the proto-symbolic, even on a proto-semiotic stage. Fluctuations of physical energy densities are perceived as differential intensities. This scheme is not an absolute one, though. “Physicality” is best conceived as a relative property. For instance, words may form a physical layer for a novel. This view has been developed and emphasized also by Bühlmann .
The key element, though often overlooked, here is “interpretation” and its structural quality. We need some habits, methods and theories to be able to interpret. As always, it is important to keep in mind that interpretation is not a formal act, since formal acts are simply rewritings of some graphemes into some other, obeying to a certain fully defined space of allowed relations and transformations. We will discuss this issue in much more detail in the chapter about models and modeling.
I other words, probabilization of observable items, even of symbolic ones, means that we transform their digital symbolics “back” into a level, which could be labeled as “proto-interpretive.” This back-transformation should neither be conceived as a kind of “particularization” nor as a kind of “atomization.” The former would assume a subsuming class, which does not exist on the proto-interpretive level, while the latter would propose a kind of independence between the almost “physical” aspects. Let us call it the level of “elements,” despite the fact that we do not mean that this level is “more elementary” in the sense of “more basic.” This again would induce the petitio principii of the class-fallacy.
The selection, the design and the arrangement of “elements” is based on habits and theories that are completely outside of the item or context at hand. Obviously we meet a circular relationship here. Yet, that’s not a surprise, we even have a word for it: culture. Ultimately, even the structure of representing identifiable items by their elements may be assumed to be unstable. We simply can’t know about the elements actually in use by principle.
Yet, on the side of the receiving body (in our case the SOM, or the human brain, respectively) this means that there are certain observables (fully within the limit of theory-boundedness of any observing), which need to be taken as densities, not as symbols. The proto-symbolic phase of observation is hence homeomorphic to the space given by the super-position of body and numbers, or more precisely, the space opened by the associative power of particularly arranged matter. As said before, in the beginning there is neither the word nor even the sign. We may call it “impressions,” coming from the external world. Nevertheless, it remains fully acceptable for us that those “impressions”, forming into signs or words downstream the perceptive processes, are also dependent—and mandatorily so—on some kind of theory. Clearly, a proper concept of theory needs to be developed here.
So we find three important elements for dealing with the question about the appropriate presentational level: probabilistic contexts, relativity of physicality, and elements as a precipitation of culture. Nice food, isn’t it?
The transition from the probabilistic to the propositional includes the genesis of labels, and later also of symbols—if the former are going to be repeated and through their usage as abbreviations, or abbreviating models. This transition thus is also the correct description of the problem of “symbol grounding,” on which there is so much babbling. It does not come as a big surprise that a combination of associative concepts and formal concepts is rated as being very promising for the further development of machine-based cognition . Yet, we have to start with the associative part.
Note that for the transition from labels to symbols we need a community, hence mediality, which both are outside of any body. If members of a community aggregate to a form that we then again call “body,” such a body is again on the lower, the “boiling”, levels of the overall system. We will meet this topic again in our discussion on complexity. and the short piece about the strong limitations of swarms and their so-called “collective intelligence.”
Since we necessarily have to refer to certain kinds of bodies, we may be allowed to keep the notion of feeding. That feeding and herding (hoarding?) depends obviously on the inner mechanisms, on the anabolic metabolism of any of the individual SOMs. How should we conceive of the digestion processes that turn “stuff” into “words”? Taking the animal body as a kind of template, we can see that the body removes most of the form of the input-information, it establishes a deformation, before any macroscopic structure is going to be assembled. It is so-to-speak a SOM-on-Steroids that is able to propel us from the body and its world to the word and its logical body.
This article was first published 11/11/2011, last revision is from 28/12/2011
-  Hilary Putnam, The Meaning of “Meaning”. Minnesota Studies in the Philosophy of Science 7:131-193. 1975. available online.
-  Vera Bühlmann, Inhabiting Media. Thesis, University of Basel, 2008.
-  Uta Priss, Associative and Formal Concepts. ICCS’02. available online.
October 20, 2011 § Leave a comment
A Map that organizes itself:
Is it the paradise for navigators, or is it the hell?
Well, it depends, I would say. As a control freak, or a warfarer like Shannon in the early 1940ies, you probably vote for the hell. And indeed, there are presumably only very few entities that have been so strongly neglected by information scientists like it was the case for the self-organizing map. Of course, there are some reasons for that. The other type of navigator probably enjoying the SOM is more likely of the type Odysseus, or Baudolino, the hero in Umberto Eco’s novel of the same name.
More seriously, the self-organizing map (SOM) is a powerful and even today (2011) a still underestimated structure, though meanwhile rapidly gaining in popularity. This chapter serves as a basic introduction into the SOM, along with a first discussion of the strength and weaknesses of its original version. Today, there are many versions around, mainly in research; the most important ones I will briefly mention at the end. It should be clear that there are tons of articles around in the web. Yet, most of them focus on the mathematics of the more original versions, but do not describe or discuss the architecture itself, or even provide a suitable high-level interpretation of what is going on in a SOM. So, I will not repeat the mathematics, instead I will try to explain it also for non-engineers without using mathematical formulas. Actually, the mathematics is not the most serious thing in it anyway.
The SOM is a bundle comprising a mathematical structure and a particularly designed procedure for feeding multi-dimensional (multi-attributes) data into it that are prepared as a table. Numbers of attributes can reach tens of thousands. Its purpose is to infer the best possible sorting of the data in a 2(3) dimensional grid. Yet, both preconditions, dimensionality and data as table, are not absolute and may be overcome by certain extensions to the original version of the SOM. The sorting process groups more similar records closer together. Thus we can say that a body of high-dimensional data (organized as records from a table) are mapped onto 2 dimensions, thereby enforcing a weighting of the properties used to describe (essentially: create) the records.
The SOM can be parametrized such that it is a very robust method for clustering data. The SOM exhibits an interesting duality, as it can be used for basic clustering as well as for target oriented predictive modeling. This duality opens interesting possibilities for realizing a pre-specific associative storage. The SOM is particularly interesting due to its structure and hence due to its extensibility, properties that other most methods do not share with the SOM. Though substantially different to other popular structures like Artificial Neural Networks, the SOM may be included into the family of connectionist models.
The development leading finally to the SOM started around 1973 in a computer science lab at the Helsinki university. It was Teuvo Kohonen who got aware to certain memory effects of correlation matrices. Until 1979, when he first published the principle of the Self-Organizing Map, he dedicatedly adopted principles known from the human brain. A few further papers followed and a book about the subject in 1983. Then, the SOM wasn’t readily adapted for at least 15 years. Its direct competitor for acceptance, the Backpropagation Artificial Neural network (B/ANN), was published in 1985, after the neural networks have been rediscovered in physics, following investigations of spin glasses and certain memory effects there. Actually, the interest in simulating neural networks dates back to 1941, when von Neumann, Hebb, McCulloch, and also Pitts, among others, met at a conference on the topic.
For a long time the SOM wasn’t regarded as a “neural network,” and this has been considered being a serious disadvantage. The first part of the diagnosis indeed was true: Kohonen never tried to simulate individual neurons, as it was the goal for all simulations of ANN. The ANN research has been deeply informed by physics, cybernetics and mathematical information theory. Simulating neurons is simply not adequate, it is kind of silly science. Above all, most ANN are just a very particular type of a “network” as there are no connections within a particular layer. In contrast, Kohonen tried to grasp a more abstract level: the population of neurons. In our opinion this choice is much more feasible and much more powerful as well. In particular, SOM can not only represent “neurons,” but any population of entities which can exchange and store information. More about that in a moment.
Nowadays, the methodology of SOM can be rated as well adopted. More than 8’000 research papers have been published so far, with increasing momentum, covering a lot of practical domains and research areas. Many have demonstrated the superiority or greater generality of SOM as compared to other methods.
The mechanism of a basic SOM is quite easy o describe, since there are only a few ingredients.
First, we need data. Imagine a table, where observations are listed in rows, and the column headers describe the variables that have been measured for each observed case. The variables are also called attributes, or features. Note, that in the case of the basic (say, the Kohonen-) SOM the structure given by the attributes is the same for all records. Technically, the data have to be normalized per column such that the minimum value is 0 and the maximum value is 1. Note, that this normalization ensures comparability of different sub-sets of observations. It represents just the most basic transformation of data, while there are many other transformation possible: logarithmic re-scaling of values of a column in order to shift the mode of the empirical distribution, splitting a variable by value, binarization, scaling of parameters that are available only on nominal niveau, or combining two or several columns by a formula are further examples (for details please visit the chapter about modeling). In fact, the transformation of data (I am not talking here about the preparation of data!) is one of the most important ingredients for successful predictive modeling.
Second, we create the SOM. Basically, and its simplest form, the SOM is a grid, where each cell has 4 (squares, rectangles) or 6 edges (hexagonal layout). The grid consists from nodes and edges. Nodes serve as a kind of container, while edges work as a kind of fibers for spreading signals. In some versions of the SOM the nodes can range freely or they can randomly move around a little bit.
An important element of the architecture of a SOM now is that each node gets the same structure assigned as we know from the table. As a consequence, the vectors collected in the nodes can easily be compared by some function (just wait a second for that). In the beginning, each node get randomly initialized. Then the data are fed into the SOM.
This data feeding is organized as follows. A randomly chosen record is taken from the table and then compared to all of the nodes. There is always a best matching node. The record then gets inserted into this node. Upon this insertion, which is kind of hiding, the values in the nodes structure vector are recalculated, e.g. as the (new) mean for all values across all records collected in a node (container). The trick now is not to change just the winning node where the data record has been inserted, but all nodes of the the close surround also, though with a strength that decreases with the distance.
This small activity of searching the best matching node, insertion and information spreading is done for all records, and possibly also repeated. The spreading of information to the neighbor nodes is a crucial element in the SOM mechanism. This spreading is responsible for the self-organization. It also represents a probabilistic coupling in a network. Of course, there are some important variants to that, see below, but basically that’s all. Ok, there is some numerical bookkeeping, optimizations to search the winning nodes etc. but these measures are not essential for the mechanism.
As a result one will find similar records in the same node, or a the direct neighbors. It has been shown that the SOM is topology preserving, that is, the SOM is smooth with regard to the similarity of neighbor nodes. The data records inside the nodes are a list, which is described by the node’s value vector. That value vector could be said to represent a class, or intension, which is defined by its empirical observations, the cases, or extension.
After feeding all data to the SOM the training has been finished. For SOM, it is easy to run in a continuous mode, where the feed of incoming data is not “stopping” at any time. Now the SOM can be used to classify new records. A new record simply needs to be compared to the nodes of the SOM, i.e. to the value vector of the nodes, but NOT to all the cases (SOM is not case-based reasoning, but type-based reasoning!). If the records contained a marker attribute, e.g. indicating the quality of the record, you will also get the expected quality for a new record of unknown quality.
Properties of the SOM
The SOM belongs to the class of clustering algorithms. It is very robust against missing values in the table, and unlike many other methods it does NOT require any settings regarding the clusters, such as size or number. Of course, this is a great advantage and a property of logical consistency. Nodes may remain empty, while the node value vector of the empty node is well-defined. This is a very important feature of the SOM, as this represents the capability to infer potential yet unseen observations. No other method is able to behave like this. Other properties can be invoked by means of possible extensions of the basic mechanism (see below)
As already said, nodes collect similar records of data, where a record represents a single observation. It is important to understand, that a node does not equal to a cluster. In our opinion, it does not make much sense to draw boundaries around one or several nodes and so proposing a particular “cluster.” This boundary should be set only upon an external purpose. Inversely, without such a purpose, it is sense-free to conceive of a trained SOM as a model. AT best, it would represent a pre-specific model, which however is a great property of the SOM to be able to create such.
The learning is competitive, since different nodes compete for a single record. Yet, it is also cooperative, since the upon an insert operation information is exchanged between neighbor nodes.
The reasoning of the SOM is type-based, which is much more favorable than case-based reasoning. It is also more flexible than ANN, which just provide a classification, but no distinction between extension and intension is provided. SOM, but not ANN, can be used in two very different modes. Either just for clustering or grouping individual observations without any further assumptions, and secondly for targeted modeling, that is for establishing a predictive/ diagnostic link between several (or many) basic input variables and one (or several) target variable(s) that represent the outcome of a process according to experience. Such a double usage of the same structure is barely accessible for any other associative structure.
Another difference is that ANN are much more suitable to approximate single analytic functions, while SOM are suitable for general classification tasks, where the parameter space and/or the value (outcome) space could even be discontinuous or folded.
A large advantage over many other methods is that the similarity function and the cost function is explicitly accessible. For ANN, SVM or statistical learning this is not possible. Similarly, the SOM automatically adapts its structure according to the data, i.e. it is also possible to change the method within the learning process, adaptively and self-organized.
As a result we can conclude that the design of the SOM method is much more transparent than that of any other of the advanced methods.
Competing Comparable Methods
SOM are either more robust, more general or more simple than any other method, while the quality of classification is closely comparable. Among those competing methods are artificial neural networks (ANN), principal component analysis (PCA), multi-dimensional scaling (MDS), or adaptive resonance theory network (ART). Important ideas of ART networks can be merged with the SOM principle, keeping the benefits of both. PCA and MDS are based on statistical correlation analysis (covariance matrices), i.e. they are importing all the assumptions and necessary precondition of statistics, namely the independence of observational variables. Yet, it is the goal to identify such dependencies, thus it is not quite feasible to presuppose that! SOM do not know such limitations from strange assumptions; else, recently it has been proven that SOM are just a generalization of PCA.
Of course, there are many other methods, like Support Vector Machines (SVM) with statistical kernels, or tree forests; yet, these methods are purely statistical in nature, with no structural part in it. Else, they do not provide access to the similarity function as it is possible for the SOM.
A last word about the particular difference between ANN and SOM. SOM are true symmetrical networks, where each unit has its own explicit memory about observations, while the linkage to other units on the same level of integration is probabilistic. That means, that the actual linkage between any two units can be changed dynamically within the learning process. In fact, a SOM is thus not a single network like a fisher net, it is much more appropriate to conceive them as a representation of a manifold of networks.
Contrary to those advanced structural properties, the so-called Artificial Neural Networks are explicit directional networks.Units represent individual neurons and do not have storage capacities. Each unit does not know anything about things like observations. Conceptually, these units are thus on a much lower level than the units in a SOM. In ANN they can not have “inner” structure. The same is true for for the links between the units. Since they have to be programmed in an explicit manner (which is called “architecture”), the topology of the connections can not be changed during learning at runtime of the program.
In ANN information flows through the units in a directed manner (as in case of natural neurons). It is there almost impossible to create an equally dense connectivity within a single layer of units as in SOM. As a consequence, ANN do not show the capability for self-organization.
Taken as whole, ANN seem to be under the representationalist delusion. In order to achieve the same general effects and abstract phenomena as the SOM are able to, very large ANN would be necessary. Hence, pure ANN are not really a valid alternative for our endeavor. This does not rule out the possibility to use them as components within a SOM or between SOMs.
Variants and Architectures
Here are some SOM extensions and improvements of the SOM.
Homogenized Extensional Diversity
The original version of the SOM tends to collect “bad” records, those not matching well anywhere else, into a single cluster, even if the records are not similar at all. In this case it is not allowed to compare nodes any more, since the internal variance is not comparable anymore and the mean/variance on the level of the node would not describe the internal variance on the level of the collected records any more. The cure for that misbehavior is rather simple. The cost function controlling the matching of a record to the most similar node needs to contain the variability within the set of records (extension of the type represented by the node) collected by the node. Else, merging and splitting of nodes as described for structural learning helps effectively. In scientific literature, there is yet no reference for this extension of the SOM.
One of the most basic extensions to the original mechanism is to allow for splitting and merging of nodes according to some internal divergence criterion. A SOM made from such nodes is able to adopt structurally to the input data, not just statistically. This feature is inspired by the so-called ART-networks . Similarly, merging and splitting of “nodes” of a SOM was proposed by , though not in the cue of ART networks.
Since the SOM represents populations of neurons, it is easy and straightforward to think about nesting of SOM. Each node would contain a smaller SOM. A node may even contain any other parametrized method, such like Artificial Neural Networks. The node value vector then would not exhibit the structure of the table, but instead would display the parameters of the enclosed algorithm. One example for this is the so-called mn-SOM .
Usually, data are not evenly distributed. Hence, some nodes grow much more than others. One way to cope with this situation is to automatically let the SOM grow. many different variants of growth could be thought of and some already has been implemented. Our experiments point into the direction of a “stem cell” analog.
Growing SOMs have first been proposed by , while  provides some exploratory implementation. Concerning growing SOM, it is very important to understand the concept (or phenomenon) of growth. We will discuss possible growth patterns and the consequences for possible and new growing SOMs elsewhere. Just for now we can say that any kind of SOM structure can grow and/or differentiate.
SOM gas, mobile nodes
The name already says it: the topology of the grid making up the SOM is not fixed. Nodes even may range around quite freely, as in the case of SOM Gas.
Starting from mobile nodes, we can think about a small set of properties of nodes which are not directly given by the data structure. These properties can be interpreted as chemicals creating sth. like a Gray-Scott Reaction-Diffusion-Model, i.e. a self-organizing fluid dynamics. The possible effects are (i) a differential opacity of the SOM for transmitted information, or (ii) the differentiation into fibers and networks, or (iii) the optimization of the topological structure as a standard part of the life cycle of the SOM. The mobility can be controlled internally by means of a “temperature,” or expressed by a metaphor, the fixed SOM would melt partially. This may help to reorganize a SOM. In scientific literature, there is yet no reference for this extension of the SOM.
Evolutionary Embedded SOM with Meta-Modeling
SOM can be embedded into an evolutionary learning about the most appropriate selection of attributes. This can be extended even towards the construction of structural hypothesis about the data. While other methods could be also embedded in a similar manner, the results are drastically different, since most methods do not learn structurally. Coupling evolutionary processes with associative structures was proposed a long time ago by , albeit only in the context of optimization of ANN. While this is quite reasonable, we additionally propose to use evolution in a different manner and for different purposes (see the chapter about evolution)
 ART networks
 merging splitting of nodes
 The mn-SOM
 Growing SOM a
 Growing SOM b
 evolutionary optimization of artificial neural networks