Analogical Thinking, revisited. (II)

March 20, 2012 § Leave a comment

In this second part of the essay about a fresh perspective on

(II/II)

analogical thinking—more precise: on models about it—we will try to bring two concepts together that at first sight represent quite different approaches: Copycat and SOM.

Why engaging in such an endeavor? Firstly, we are quite convinced that FARG’s Copycat demonstrates an important and outstanding architecture. It provides a well-founded proposal about the way we humans apply ideas and abstract concepts to real situations. Secondly, however, it is also clear that Copycat suffers from a few serious flaws in its architecture, particularly the built-in idealism. This renders any adaptation to more realistic domains, or even to completely domain-independent conditions very, very difficult, if not impossible, since this drawback also prohibits structural learning. So far, Copycat is just able to adapt some predefined internal parameters. In other words, the Copycat mechanism just adapts a predefined structure, though a quite abstract one, to a given empiric situation.

Well, basically there seem to be two different, “opposite” strategies to merge these approaches. Either we integrate the SOM into Copycat, or we try to transfer the relevant yet to be identified parts from Copycat to a SOM-based environment. Yet, at the end of day we will see that and how the two alternatives converge.

In order to accomplish our goal of establishing a fruitful combination between SOM and Copycat we have to take mainly three steps. First, we briefly recapitulate the basic elements of Copycat and the proper instance of a SOM-based system. We also will describe the extended SOM system in some detail, albeit there will be a dedicated chapter on it. Finally, we have to transfer and presumably adapt those elements of the Copycat approach that are missing in the SOM paradigm.

Crossing over

The particular power of (natural) evolutionary processes derives from the fact that it is based on symbols. “Adaptation” or “optimization” are not processes that change just the numerical values of parameters of formulas. Quite to the opposite, adaptational processes that span across generations parts of the DNA-based story is being rewritten, with potential consequences for the whole of the story. This effect of recombination in the symbolic space is particularly present in the so-called “crossing over” during the production of gamete cells in the context of sexual reproduction in eukaryotes. Crossing over is a “technique” to dramatically speed up the exploration of the space of potential changes. (In some way, this space is also greatly enlarged by symbolic recombination.)

What we will try here in our attempt to merge the two concepts of Copycat and SOM is exactly this: a symbolic recombination. The difference to its natural template is that in our case we do not transfer DNA-snippets between homologous locations in chromosomes, we transfer whole “genes,” which are represented by elements.

Elementarizations I: C.o.p.y.c.a.t.

In part 1 we identified two top-level (non-atomic) elements of Copycat

Since the first element, covering evolutionary aspects such as randomness, population and a particular memory dynamics, is pretty clear and a whole range of possible ways to implement it are available, any attempt for improving the Copycat approach has to target the static, strongly idealistic characteristics of the the structure that is called “Slipnet” by the FARG’s. The Slipnet has to be enabled for structural changes and autonomous adaptation of its parameters. This could be accomplished in many ways, e.g. by representing the items in the Slipnet as primitive artificial genes. Yet, we will take a different road here, since the SOM paradigm already provides the means to achieve idealizations.

At that point we have to elementarize Copycat’s Slipnet in a way that renders it compatible with the SOM principles. Hofstadter emphasizes the following properties of the Slipnet and the items contained therein (pp.212).

  • (1) Conceptual depth allows for a dynamic and continuous scaling of “abstractness” and resistance against “slipping” to another concept;
  • (2) Nodes and links between nodes both represent active abstract properties;
  • (3) Nodes acquire, spread and lose activation, which knows an switch-on threshold < 1;
  • (4) The length of links represents conceptual proximity or degree of association between the nodes.

As a whole, and viewed from the network perspective, the Slipnet behaves much like a spring system, or a network built from rubber bands, where the springs or the rubber bands are regulated in their strength. Note that our concept of SomFluid also exhibits the feature of local regulation of the bonds between nodes, a property that is not present in the idealized standard SOM paradigm.

Yet, the most interesting properties in the list above are (1) and (2), while (3) and (4) are known in the classic SOM paradigm as well. The first item is great because it represents an elegant instance of creating the possibility for measurability that goes far beyond the nominal scale. As a consequence, “abstractness” ceases to be nominal none-or-all property, as it is present in hierarchies of abstraction. Such hierarchies now can be recognized as mere projections or selections, both introducing a severe limitation of expressibility. The conceptual depth opens a new space.

The second item is also very interesting since it blurs the distinction between items and their relations to some extent. That distinction is also a consequence of relying too readily on the nominal scale of description. It introduces a certain moment of self-reference, though this is not fully developed in the Slipnet. Nevertheless, a result of this move is that concepts can’t be thought without their embedding into other a neighborhood of other concepts. Hofstadter clearly introduces a non-positivistic and non-idealistic notion here, as it establishes a non-totalizing meta-concept of wholeness.

Yet, the blurring between “concepts” and “relations” could be and must be driven far beyond the level Hofstadter achieved, if the Slipnet should become extensible. Namely, all the parts and processes of the Slipnet need to follow the paradigm of probabilization, since this offers the only way to evade the demons of cybernetic idealism and control apriori. Hofstadter himself relies much on probabilization concerning the other two architectural parts of Copycat. Its beyond me why he didn’t apply it to the Slipnet too.

Taken together, we may derive (or: impose) the following important elements for an abstract description of the Slipnet.

  • (1) Smooth scaling of abstractness (“conceptual depth”);
  • (2) Items and links of a network of sub-conceptual abstract properties are instances of the same category of “abstract property”;
  • (3) Activation of abstract properties represents a non-linear flow of energy;
  • (4) The distance between abstract properties represents their conceptual proximity.

A note should be added regarding the last (fourth) point. In Copycat, this proximity is a static number. In Hofstadter’s framework, it does not express something like similarity, since the abstract properties are not conceived as compounds. That is, the abstract properties are themselves on the nominal level. And indeed, it might appear as rather difficult to conceive of concepts as “right of”, “left of”, or “group” as compounds. Yet, I think that it is well possible by referring to mathematical group theory, the theory of algebra and the framework of mathematical categories. All of those may be subsumed into the same operationalization: symmetry operations. Of course, there are different ways to conceive of symmetries and to implement the respective operationalizations. We will discuss this issue in a forthcoming essay that is part of the series “The Formal and the Creative“.

The next step is now to distill the elements of the SOM paradigm in a way that enables a common differential for the SOM and for Copycat..

Elementarizations II: S.O.M.

The self-organizing map is a structure that associates comparable items—usually records of values that represent observations—according to their similarity. Hence, it makes two strong and important assumptions.

  • (1) The basic assumption of the SOM paradigm is that items can be rendered comparable;
  • (2) The items are conceived as tokens that are created by repeated measurement;

The first assumption means, that the structure of the items can be described (i) apriori to their comparison and (ii) independent from the final result of the SOM process. Of course, this assumption is not unique to SOMs, any algorithmic approach to the treatment of data is committed to it. The particular status of SOM is given by the fact—and in stark contrast to almost any other method for the treatment of data—that this is the only strong assumption. All other parameters can be handled in a dynamic manner. In other words, there is no particular zone of the internal parametrization of a SOM that would be inaccessible apriori. Compare this with ANN or statistical methods, and you feel the difference…  Usually, methods are rather opaque with respect to their internal parameters. For instance, the similarity functional is usually not accessible, which renders all these nice looking, so-called analytic methods into some kind of subjective gambling. In PCA and its relatives, for instance, the similarity is buried in the covariance matrix, which in turn is only defined within the assumption of normality of correlations. If not a rank correlation is used, this assumption is extended even to the data itself. In both cases it is impossible to introduce a different notion of similarity. Else, and also as a consequence of that, it is impossible to investigate the particular dependency of the results proposed by the method from the structural properties and (opaque) assumptions. In contrast to such unfavorable epistemo-mythical practices, the particular transparency of the SOM paradigm allows for critical structural learning of the SOM instances. “Critical” here means that the influence of internal parameters of the method onto the results or conclusions can be investigated, changed, and accordingly adapted.

The second assumption is implied by its purpose to be a learning mechanism. It simply needs some observations as results of the same type of measurement. The number of observations (the number of repeats) has to  exceed a certain lower threshold, which, dependent on the data and the purpose, is at least 8, typically however (much) more than 100 observations of the same kind are needed. Any result will be within the space delimited by the assignates (properties), and thus any result is a possibility (if we take just the SOM itself).

The particular accomplishment of a SOM process is the transition from the extensional to the intensional description, i.e. the SOM may be used as a tool to perform the step from tokens to types.

From this we may derive the following elements of the SOM:1

  • (1) a multitude of items that can be described within a common structure, though not necessarily an identical one;
  • (2) a dense network where the links between nodes are probabilistic relations;
  • (3) a bottom-up mechanism which results in the transition from an extensional to an intensional level of description;

As a consequence of this structure the SOM process avoids the necessity to compare all items (N) to all other items (N-1). This property, together with the probabilistic neighborhoods establishes the main difference to other clustering procedures.

It is quite important to understand that the SOM mechanism as such is not a modeling procedure. Several extensions have to be added and properly integrated, such as

  • – operationalization of the target into a target variable;
  • – validation by separate samples;
  • – feature selection, preferably by an instance of  a generalized evolutionary process (though not by a genetic algorithm);
  • – detecting strong functional and/or non-linear coupling between variables;
  • – description of the dependency of the results from internal parameters by means of data experiments.

We already described the generalized architecture of modeling as well as the elements of the generalized model in previous chapters.

Yet, as we explained in part 1 of this essay, analogy making is conceptually incompatible to any kind of modeling, as long as the target of the model points to some external entity. Thus, we have to choose a non-modeling instance of a SOM as the starting point. However, clustering is also an instance of those processes that provide the transition from extensions to intensions, whether this clustering is embedded into full modeling or not. In other words, both the classic SOM as well as the modeling SOM are not suitable as candidates for a merger with Copycat.

SOM-based Abstraction

Fortunately, there is already a proposal, and even a well-known one, that indeed may be taken as such a candidate: the two-layer SOM (TL-SOM) as it has been demonstrated as essential part of the so-called WebSom [1,2].

Actually, the description as being “two layered” is a very minimalistic, if not inappropriate description what is going on in the WebSom. We already discussed many aspects of its architecture here and here.

Concerning our interests here, the multi-layered arrangement itself is not a significant feature. Any system doing complicated things needs a functional compartmentalization; we have met a multi-part, multi-compartment and multi-layered structure in the case of Copycat too. Else, the SOM mechanism itself remains perfectly identical across the layers.

The real interesting features of the approach realized in the TL-SOM are

  • – the preparation of the observations into probabilistic contexts;
  • – the utilization of the primary SOM as a measurement device (the actual trick).

The domain of application of the TL-SOM is the comparison and classification of texts. Texts belong to unstructured data and the comparison of texts is exposed to the same problematics as the making of analogies: there is no apriori structure that could serve as a basis for modeling. Also, as the analogies investigated by the FARG the text is a locational phenomenon, i.e. it takes place in a space.

Let us briefly recapitulate the dynamics in a TL-SOM. In order to create a TL-SOM the text is first dissolved into overlapping, probabilistic contexts. Note that the locational arrangement is captured by these random contexts. No explicit apriori rules are necessary to separate patterns. The resulting collection of  contexts then gets “somified”. Each node then contains similar random contexts that have been derived from various positions in different texts. Now the decisive step will be taken, which consists in turning the perspective by “90 degrees”: We can use the SOM as the basis for creating a histogram for each of the texts. The nodes are interpreted as properties of the texts, i.e. each node represents a bin of the histogram. The values of the individual bins measure the frequency of the text as it is represented by the respective random context. The secondary SOM then creates a clustering across these histograms, which represent the texts in an abstract manner.

This way the primary lattice of the TL-SOM is used to impose a structure on the unstructured entity “text.”

Figure 1: A schematic representation of a two-layered SOM with built-in self-referential abstraction. The input for the secondary SOM (foreground) is derived as a collection of histograms that are defined as a density across the nodes of the primary SOM (background). The input for the primary SOM are random contexts.

To put it clearly: the secondary SOM builds an intensional description of entities that results from the interaction of a SOM with a probabilistic description of the empirical observations. Quite obviously, intensions built this way about intensions are not only quite abstract, the mechanism could even be stacked. It could be described as “high-level perception” as justified as Hofstadter uses the term for Copycat. The TL-SOM turns representational intensions into abstract, structural ones.

The two aspects from above thus interact, they are elements of the TL-SOM. Despite the fact that there are still transitions from extensions to intensions, we also can see that the targeted units of the analysis, the texts get probabilistically distributed across an area, the lattice of the primary SOM. Since the SOM maps the high-dimensional input data onto its map in a way that preserves their topological properties, it is easy to recognize that the TL-SOM creates conceptual halos as an intermediate.

So let us summarize the possibilities provided by the SOM.

  • (1) SOMs are able to create non-empiric, or better: de-empirified idealizations of intensions that are based on “quasi-empiric” input data;
  • (2) TL-SOMs can be used to create conceptual halos.

In the next section we will focus on this spatial, better: primarily spatial effect.

The Extended SOM

Kohonen and co-workers [1,2] proposed to build histograms that reflect the probability density of a text across the SOM. Those histograms represent the original units (e.g. texts) in a quite static manner, using a kind of summary statistics.

Yet, texts are definitely not a static phenomenon. At first sight there is at least a series, while more appropriately texts are even described as dynamic networks of own associative power [3]. Returning to the SOM we see that additionally to the densities scattered across the nodes of the SOM we also can observe a sequence of invoked nodes, according to the sequence of random contexts in the text (or the serial observations)

The not so difficult question then is: How to deal with that sequence? Obviously, it is again and best conceived as a random process (though with a strong structure), and random processes are best described using Markov models, either as hidden (HMM) or as transitional models. Note that the Markov model is not a model about the raw observational data, it describes the sequence of activation events of SOM nodes.

The Markov model can be used as a further means to produce conceptual halos in the sequence domain. The differential properties of a particular sequence as compared to the Markov model then could be used as further properties to describe the observational sequence.

(The full version of the extended SOM comprises targeted modeling as a further level. Yet, this targeted modeling does not refer to raw data. Instead, its input is provided completely by the primary SOM, which is based on probabilistic contexts, while the target of such modeling is just internal consistency of a context-dependent degree.)

The Transfer

Just to avoid misunderstanding: it does not make sense to try representing Copycat completely by a SOM-based system. The particular dynamics and phenomenologically behavior depends a lot on Copycat’s tripartite morphology as represented by the Coderack (agents), the Workspace and the Slipnet. We are “just” in search for a possibility to remove the deep idealism from the Slipnet in order to enable it for structural learning.

Basically, there are two possible routes. Either we re-interpret the extended SOM in a way that allows us to represent the elements of the Slipnet as properties of the SOM, or we try to replace the all items in the Slipnet by SOM lattices.

So, let us take a look which structures we have (Copycat) or what we could have (SOM) on both sides.

Table 1: Comparing elements from Copycat’s Slipnet to the (possible) mechanisms in a SOM-based system.

Copycat extended SOM
 1. smoothly scaled abstraction Conceptual depth (dynamic parameter) distance of abstract intensions in an integrated lattice of a n-layered SOM
 2.  Links as concepts structure by implementation reflecting conceptual proximity as an assignate property for a higher-level
 3. Activation featuring non-linear switching behavior structure by implementation x
 4. Conceptual proximity link length (dynamic parameter) distance in map (dynamic parameter)
 5.  Kind of concepts locational, positional symmetries, any

From this comparison it is clear that the single most challenging part of this route is the possibility for the emergence of abstract intensions in the SOM based on empirical data. From the perspective of the SOM, relations between observational items such as “left-most,” “group” or “right of”, and even such as “sameness group” or “predecessor group”, are just probabilities of a pattern. Such patterns are identified by functions or dynamic combinations thereof. Combinations ot topological primitives remain mappable by analytic functions. Such concepts we could call “primitive concepts” and we can map these to the process of data transformation and the set of assignates as potential properties.2 It is then the job of the SOM to assign a relevancy to the assignates.

Yet, Copycat’s Slipnet comprises also rather abstract concepts such as “opposite”. Further more, the most abstract concepts often act as links between more primitive concepts, or, in Hofstadter terms, conceptual items of lower “conceptual depth”.

My feeling here is that it is a fundamental mistake to implement concepts like “opposite” directly. What is opposite of something else is a deeply semantic concept in itself, thus strongly dependent on the domain. I think that most of the interesting concepts, i.e. the most abstract ones are domain-specific. Concepts like “opposite” could be considered as something “simple” only in case of geometric or spatial domains.

Yet, that’s not a weakness. We should use this as a design feature. Take the following rather simple case as shown in the next figure as an example. Here we mapped simply triplets of uniformly distributed random values onto a SOM. The three values can be readily interpreted as parts of a RGB value, which renders the interpretation more intuitive. The special thing here is that the map has been a really large one: We defined approximately 700’000 nodes and fed approx. 6 million observations into it.

Figure 2: A SOM-based color map showing emergence of abstract features. Note that the topology of the map is a borderless toroid: Left and right borders touch each other (distance=0), and the same applies to the upper and lower borders.

We can observe several interesting things. The SOM didn’t come up with just any arbitrary sorting of the colors. Instead, a very particular one emerged.

First, the map is not perfectly homogeneous anymore. Very large maps tend to develop “anisotropies”, symmetry breaks if you like, simply due to the fact the the signal horizon becomes an important issue. This should not be regarded as a deficiency though. Symmetry breaks are essential for the possibility of the emergence of symbols. Second, we can see that two “color models” emerged, the RGB model around the dark spot in the lower left, and the YMC model around the bright spot in the upper right. Third, the distance between the bright, almost white spot and the dark, almost black one is maximized.

In other words, and not quite surprising, the conceptual distance is reflected as a geometrical distance in the SOM. As it is the case in the TL-SOM, we now could use the SOM as a measurement device that transforms an unknown structure into an internal property, simply by using the locational property in the SOM as an assignate for a secondary SOM. In this way we not only can represent “opposite”, but we even have a model procedure for “generalized oppositeness” at out disposal.

It is crucial to understand this step of “observing the SOM”, thereby conceiving the SOM as a filter, or more precisely as a measurement device. Of course, at this point it becomes clear that a large variety of such transposing and internal-virtual measurement devices may be thought of. Methodologically, this opens an orthogonal dimension to the representation of data, resembling strongly to the concept of orthoregulation.

The map shown above even allows to create completely different color models, for instance one around yellow and another one around magenta. Our color psychology is strongly determined by the sun’s radiated spectrum and hence it reflects a particular Lebenswelt; yet, there is no necessity about it. Some insects like bees are able to perceive ultraviolet radiation, i.e. their colors may have 4 components, yielding a completely different color psychology, while the capability to distinguish colors remains perfectly.3

“Oppositeness” is just a “simple” example for an abstract concept and its operationalization using a SOM. We already mentioned the “serial” coherence of texts (and thus of general arguments) that can be operationalized as sort of virtual movement across a SOM of a particular level of integration.

It is crucial to understand that there is no other model besides the SOM that combines the ability to learn from empirical data and the possibility for emergent abstraction.

There is yet another lesson that we can take home from the simple example above. Well, the example doesn’t not remain that simple. High-level abstraction, items of considerable conceptual depth, so to speak, requires rather short assignate vectors. In the process of learning qua abstraction it appears to be essential that the masses of possible assignates derived from or imposed by measurement of raw data will be reduced. On the one hand, empiric contexts from very different domains should be abstracted, i.e. quite literally “reduced”, into the same perspective. On the other hand, any given empiric context should be abstracted into (much) more than just one abstract perspective. The consequence of that is that we need a lot of SOMs, all separated “sufficiently” from each other. In other words, we need a dynamic population of Self-organizing maps in order to represent the capability of abstraction in real-life. “Dynamic population” here means that there are developmental mechanisms that result in a proliferation, almost a breeding of new SOM instances in a seamless manner. Of course, the SOM instances themselves have to be able to grow and to differentiate, as we have described it here and here.

In a population of SOM the conceptual depth of a concept may be represented by the efforts to arrive at a particular abstract “intension.” This not only comprises the ordinary SOM lattices, but also processes like Markov models, simulations, idealizations qua SOMs, targeted modeling, transition into symbolic space, synchronous or potential activations of other SOM compartments etc. This effort may be represented finally as a “number.”

Conclusions

The structure of multi-layered system of Self-organizing Maps as it has been proposed by Kohonen and co-workers is a powerful model to represent emerging abstraction in response to empiric impressions. The Copycat model demonstrates how abstraction could be brought back to the level of application in order to become able to make analogies and to deal with “first-time-exposures”.

Here we tried to outline a potential path to bring these models together. We regard this combination in the way we proposed it (or a quite similar one) as crucial for any advance in the field of machine-based episteme at large, but also for the rather confined area of machine learning. Attempts like that of Blank [4] appear to suffer seriously from categorical mis-attributions. Analogical thinking does not take place on the level of single neurons.

We didn’t discuss alternative models here (so far, a small extension is planned). The main reasons are that first it would be an almost endless job, and second that Hofstadter already did it and as a result of his investigation he dismissed all the alternative approaches (from authors like Gentner, Holyoak, Thagard). For an overview Runco [5] about recent models on creativity, analogical thinking, or problem solving provides a good starting point. Of course, many authors point to roughly the same direction as we did here, but mostly, the proposals are circular, not helpful because the problematic is just replaced by another one (e.g. the infamous and completely unusable “divergent thinking”), or can’t be implemented for other reasons. Thagard [6] for instance, claim that a “parallel satisfaction of the constraints of similarity, structure and purpose” is key in analogical thinking. Given our analysis, such statements are nothing but a great mess, mixing modeling, theory, vagueness and fluidity.

For instance, in cognitive psychology and in the field of artificial intelligence as well, the hypothesis of Structural Mapping (STM) finds a lot of supporters [7]. Hofstadter discusses similar approaches in his book. The STM hypothesis is highly implausible and obviously a left-over of the symbolic approach to Artificial Intelligence, just transposed into more structural regions. The STM hypothesis has not only to be implemented as a whole, it also has to be implemented for each domain specifically. There is no emergence of that capability.

The combination of the extended SOM—interpreted as a dynamic population of growing SOM instances—with the Copycat mechanism indeed appears as a self-sustaining approach into proliferating abstraction and—quite significant—back from it into application. It will be able to make analogies on any field already in its first encounter with it, even regarding itself, since both the extended SOM as well as the Copycat comprise several mechanisms that may count as precursors of high-level reflexivity.

After this proposal little remains to be said on the technical level. One of those issues which remain to be discussed is the conditions for the possibility of binding internal processes to external references. Here our favorite candidate principle is multi-modality, that is the joint and inextricable “processing” (in the sense of “getting affected”) of words, images and physical signals alike. In other words, I feel that we have come close to the fulfillment of the ariadnic question this blog:”Where is the Limit?” …even in its multi-faceted aspects.

A lot of implementation work has now to be performed, eventually commented by some philosophical musings about “cognition”, or more appropriate the “epistemic condition.” I just would like to invite you to stay tuned for the software publications to come (hopefully in the near future).

Notes

1. see also the other chapters about the SOM, SOM-based modeling, and generalized modeling.

2. It is somehow interesting that in the brain of many animals we can find very small groups of neurons, if not even single neurons, that respond to primitive features such as verticality of lines, or the direction of the movement of objects in the visual field.

3. Ludwig Wittgenstein insisted all the time that we can’t know anything about the “inner” representation of “concepts.” It is thus free of any sense and meaning to claim knowledge about the inner state of oneself as well as of that of others. Wilhelm Vossenkuhl introduces and explains the Wittgensteinian “grammatical” solipsism carefully and in a very nice way.[8]  The only thing we can know about inner states is that we use certain labels for it, and the only meaning of emotions is that we do report them in certain ways. In other terms, the only thing that is important is the ability to distinguish ones feelings. This, however, is easy to accomplish for SOM-based systems, as we have been demonstrating here and elsewhere in this collection of essays.

4. Don’t miss Timo Honkela’s webpage where one can find a lot of gems related to SOMs! The only puzzling issue about all the work done in Helsinki is that the people there constantly and pervasively misunderstand the SOM per se as a modeling tool. Despite their ingenuity they completely neglect the issues of data transformation, feature selection, validation and data experimentation, which all have to be integrated to achieve a model (see our discussion here), for a recent example see here, or the cited papers about the Websom project.

  • [1] Timo Honkela, Samuel Kaski, Krista Lagus, Teuvo Kohonen (1997). WEBSOM – Self-Organizing Maps of Document Collections. Neurocomputing, 21: 101-117.4
  • [2] Krista Lagus, Samuel Kaski, Teuvo Kohonen in Information Sciences (2004)
    Mining massive document collections by the WEBSOM method. Information Sciences, 163(1-3): 135-156. DOI: 10.1016/j.ins.2003.03.017
  • [3] Klaus Wassermann (2010). Nodes, Streams and Symbionts: Working with the Associativity of Virtual Textures. The 6th European Meeting of the Society for Literature, Science, and the Arts, Riga, 15-19 June, 2010. available online.
  • [4 ]Douglas S. Blank, Implicit Analogy-Making: A Connectionist Exploration.Indiana University Computer Science Department. available online.
  • [5] Mark A. Runco, Creativity-Research, Development, and Practice Elsevier 2007.
  • [6] Keith J. Holyoak and Paul Thagard, Mental Leaps: Analogy in Creative Thought.
    MIT Press, Cambridge 1995.
  • [7] John F. Sowa, Arun K. Majumdar (2003), Analogical Reasoning.  in: A. Aldo, W. Lex, & B. Ganter (eds.), “Conceptual Structures for Knowledge Creation and Communication,” Proc.Intl.Conf.Conceptual Structures, Dresden, Germany, July 2003.  LNAI 2746, Springer New York 2003. pp. 16-36. available online.
  • [8] Wilhelm Vossenkuhl. Solipsismus und Sprachkritik. Beiträge zu Wittgenstein. Parerga, Berlin 2009.

.

Advertisements

Analogical Thinking, revisited.

March 19, 2012 § Leave a comment

What is the New York of California?

(I/II)

Or even, what is the New York of New York? Almost everybody will come up with the same answer, despite the fact that not only the question is not only ill-defined. Both the question and its answer can be described only after the final appearance of the answer. In other words, it is not possible to provide any proposal about the relevance of those properties apriori to its completion, that aposteriori are easily tagged as relevant for the description of both the question as well as the answer. Both the question and the solution do not “exist” in the way that is pretended by their form before we have finished making sense of it. There is a wealth of philosophical issues around this phenomenon, which we all have to bypass here. Here we will focus just on the possibility for mechanisms that could be invoked in order to build a model that is capable to behave phenomeno-logically “as if“.

The credit to render such questions and the associated problematics salient in the area of computer models of thinking belongs to Douglas Hofstadter and his “Fluid Analogy Research group” (FARG). In his book “Fluid Concepts and Creative Analogies” that we already mentioned here he proposes a particular model of which he claims that it is a proper model for analogical thinking. In constructing this model, which took more than 10 years of research, we did not try to stick (to get stuck?) to the neuronal level. Accordingly, one can’t describe the performance of a tennis player at the molecular level, he says. Remarkably, he also keeps the so-called cognitive sciences and their laboratory wisdom at distance. Instead, his starting point is the everyday language, and presumably a good deal of introspection as well. He sees his model located at an intermediate level between the neurons and consciousness (quite a large field, though).

His overarching claim is as simple as it is distant from the main stream of AI and cognitive science. (Note that Hofstadter does not formulate “analogical reasoning.”)

Thinking is largely equivalent with making analogies.

Hofstadter is not interested to produce just another model for analogy making. There are indeed quite a lot of such models, which he discusses in great detail. And he refutes them all; he proofs that they are all ill-posed, since they all do not start with perception. Without exception they all assume that the “knowledge” is already in the computer and based on this assumption some computer program is established. Of course, such approaches are nonsense, euphemistically called “knowledge acquisition bottleneck” by people working in the field of AI / machine learning. Yet, knowledge is nothing that could be externalized and then acquired subsequently by some other party, it can’t be found “in” the world, and of course it can’t be separated as something that “exists” beside the processing mechanisms of the brain, making the whole thing “smart”. As already mentioned, such ideas are utter nonsense.

Hofstadter’s basic strategy is different. He proposes to create a software system that is able for “concept slipping” as an emergent phenomenon, deeply based on perceptional mechanisms. He even coined the term “high-level perception.”

That is, the […] project is not about simulating analogy-making per se, but about simulating the very crux of human cognition: fluid concepts. (p.208)

This essay will investigate his model. We will find that despite its appeal it is nevertheless seriously unrealistic, even according to Hofstadter’s own standards. Yet, despite its particular weaknesses it also demonstrates very interesting mechanisms. After extracting the cornerstones of his model we will try to map his insights to the world of self-organizing maps. We also will discuss how to transfer the interesting parts of Hofstadter’s model. Hofstadter himself clearly stated the deficiencies of “connectionist models” of “learning,” yet, my impression is that he was not aware about self-organizing maps at this time. By “connectionism” he obviously referred to artificial neural networks (ANN), and for those we completely agree to his critique.

Before we start I would like to provide some original sources, that is, copies of those parts that are most relevant for this essay. These parts are from chapter 5chapter 7 and chapter 8 of the aforementioned book. There you will find much more details and lucid examples about it in Hofstadter’s own words.

Is there an Alternative to Analogies?

In order to find an alternative we have to take a small bird’s view. Very coarsely spoken, thinking transforms some input into some output while being affected and transforming itself. In some sense, any transformation of input to output transforms the transforming instance, though in vastly different degrees. A trivial machine just wears off, a trivial computer—that is, any digital machine that fits into the scheme of the Turing-computing1—can be reset to meet exactly a previous state. As soon as historical contingency is involved, reproducibility vanishes and strictly non-technical entities appear: memory, value, and semantics (among others).

This transformation game applies to analogy making, and it also applies to traditional modeling.Is it possible to apply any kind of modeling to the problematics that is represented by the “transfer game”, for which those little questions posed in the beginning are just an example?

In his context, Hofstadter calls the modeling approach the brute-force approach (p.327, chp.8). The outline of the modeling approach could look like this (p.337).

  • Step 1: Run down the apriori list of city-characterization criteria and characterize the “source town” A according to each of them.
  • Step 2: Retrieve an apriori list of “target towns” inside target region Y from the data base.
  • Step 3: For each retrieved target town X, run down the a priori list of city-characterization criteria again, calculating X’s numerical degree of match with A for every criterion in the list.
  • Step 4: For each target town X, sum up the points generated in Step 3, possibly using apriori weights, thus allowing some criteria to be counted more heavily than others.
  • Step 5: Locate the target town with the highest overall rating as calculated in Step 4, and propose it as “the A of Y”.

Any plausible apriori list of city-characterization criteria would be long, very long indeed. Effectively, it can’t be limited in advance, since any imposed limit would represent a model that would claim to be better suited to decide about the criteria than the model being built. We are crashed by an infinite regress, not just in theory. What we experience here is Wittgenstein’s famous verdict that justifications have to come to an end. Rules are embedded in the form of life (“Lebensform”) and without knowing all about a particular Lebensform and to take into consideration anything comprised by such (impossible) knowledge we can’t start to model at all.

He identifies four characteristic difficulties for the modeling approach with regard to his little “transfer game” that plays around with cities.

  • – Difficulty 1: It is psychologically unrealistic to explicitly consider all the towns one knows in a given region in order to come up with a reasonable answer.
  • – Difficulty 2: Comparison of a target town and a source town according to a specific city-characterization criterion is not a hard-edged mechanical task, but rather, can itself constitute an analogy problem as complex as the original top-level puzzle.
  • – Difficulty 3: There will always be source towns A whose “essence”—that is, set of most salient characteristics—is not captured by a given fixed list of city-characterization criteria.
  • – Difficulty 4: What constitutes a “town in region Y” is not apriori evident.

Hofstadter underpins his point with the following question (p.347).

What possible set of apriori criteria would allow a computer to reply, perfectly self-confidently, that the country of Monaco is “the Atlantic City of France”?

Of course, the “computer” should come up with the answer in a way that is not pre-programmed explicitly.

Obviously, the problematics of making analogies can’t be solved algorithmically. There is not only no such thing as a single “solution”, even the criteria to describe the problem are missing. Thus we can conclude that modeling, even in its non-algorithmical form, is not a viable alternative to analogy making.

The FARG Model

In the following, we investigate the model as proposed by Hofstadter and his group, mainly Melanie Mitchell. This is separated into the parts

  • – precis of the model,
  • – its elements,
  • – its extension as proposed by Hofstadter,
  • – the main problems of the model, and finally,
  • – the main superior aspects of the model as compared to connectionist models (from Hofstadter’s perspective, of course).
Precis of the Model

Hofstadter’s conclusion from the problems with the model-based approach and thus also the starting point for his endeavor is that the making of an analogy must appear as an emergent phenomenon. Analogy itself can’t be “defined” in terms of criteria, beyond sort f rather opaque statements about “similarity.” The point is that this similarity could be measured only aposteriori, so this concept does not help. The capability for making analogies can’t be programmed explicitly. It would not be “making” of analogies anymore, it would just be a look-up of dead graphems (not even symbols!) in a database.

He proofs his ideas by means of a small software called “Copycat”. This name derives from the internal processes of the software, as making “almost identical copies” is an important ingredient of it. Yet, it also refers to the problem that appears if you say: “I am doing this, now do the same thing…”

Copycat has three major parts, which he labels as (i) the Slipnet, (ii) the Workspace, (iii) the Coderack.

The Coderack is a rack that serves as a launching site for a population of agents of various kinds. Agents decease and are being created in various ways. They may be spawned by other agents, by the Coderack, or by any of the items in the Slipnet—as a top-down specialist bred just to engage in situations represented by the Slipnet item. Any freshly created agent will be first put into the Coderack, regardless its originator or kind.

Any particular agent behaves as a specialist for recognizing a particular situation or to establish a particular relation between parts of the input “data, ” the initial observation. This recognition requires a model apriori, of course. Since these models are rather abstract as compared to the observational data, Hofstadter calls them “concepts.” After their set up, agents are put into the Coderack from where they start in random order, but also dependent on their “inner state,” which Hofstadter calls “pressure.”

The Slipnet is a loose “network” of deep and/or abstract concepts. In case of Copycat these concepts comprise

a, b, c, … , z, letter, successor, predecessor, alphabetic-first, alphabetic-last, alphabetic position, left, right, direction, leftmost, rightmost, middle, string position, group, sameness group, successor group, predecessor group, group length, 1, 2, 3, sameness, and opposite,

In total there are more than 60 of such concepts. These items are linked together, while the length of the link reflects the “distance” between concepts. This distance changes while Copycat is working on a particular task. The change is induced by the agents in response to their “success.” The Slipnet is not really a “network,” since it is neither a logistic network (it doesn’t transport anything) nor is it an associative network like a SOM. It is also not suitable to conceive it as a kind of filter in the sense of a spider’s web, or a fisherman’s net. It is thus more appropriate to consider it simply as a non-directed, dynamic graph, where discrete items are linked.

Finally, the third aspect is the Workspace. Hofstadter describes it as a “busy construction site” and likens it to the cytoplasm (p.216). In the Workspace, the agents establish bonds between the atomic items of the observation. As said, each agent knows nothing about the posed problem, it is just capable to perform on a mini-aspect of the task. The whole population of agents, however, build something larger. It looks much like the activity in ants or termites, building some morphological structure in the hive, or a macroscopic dynamic effect as hive population. The Workspace is the location of such intermediate structures of various degrees of stability, meaning that some agents also work to remove a particular structure.

So far we have described the morphology. The particular dynamics unfolding on this morphology is settled between competition and cooperation, with the result of a collective calming down of the activities. The decrease in activity is itself an emergent consequence of the many parallel processes inside Copycat.

A single run of Copycat yields one instance of the result. Yet, a single answer is not the result itself. Rather, as different runs of Copycat yield different singular answers, the result consists of a probability density for different singular answers. For the letter-domain in which Copycat is working the result look like this:

Figure 1: Probability densities as result of a Copycat run.

The Elements of the FARG Model

Before we proceed, I should emphasize that  here “element” is used as we have introduced the term here.

Returning to the FARG model, it is important to understand that a particularly constraint randomness plays a crucial role in its setup. The population of agents does not search through all possibilities all the time. Yet, any existing intermediate result, say structural hypothesis, serves as a constraint for the future search.

We also find different kinds of memories with different durations, we find dynamic historic constraints, which we also could call contingencies. We have a population of different kinds of agents that cooperate and compete. In some almost obvious way, Copycat’s mechanisms may be conceived as an instance of the generalized evolution that we proposed earlier. Hofstadter himself is not aware that he just proposed a mechanism for generalized evolutionary changes. He calls the process “parallel terraced scan”, thereby unnecessarily sticking to a functional perspective. Yet, we consider generalized evolution as one of the elements of Copycat. It could really be promising to develop Copycat as an alternative to so-called genetic algorithms.2

Despite a certain resemblance to natural evolution the mechanisms built into Copycat do not comprise an equivalent to what is known from biology as “gene doubling”. Gene doubling and the akin part of gene deletion are probably the most important mechanisms in natural evolution. Copycat produces different kinds of agents, but the informational setup of these agents does not change as it is given by the Slipnet. The equivalent to gene doubling would have to be implemented into the Slipnet. On the other hand, however, it is clear that the items in the Slipnet are too concrete, almost representational. In contrast, genes usually do not represent a particular function on the macro-level (which is one of the main structural faults of so-called genetic algorithms). So, we conclude that Copycat contains a restricted version of generalized evolution. Else, we see a structural resemblance to the theories of Edelman and his neuronal Darwinism, which actually is a nice insight.

Conceiving large parts of the mechanism of Copycat as (restricted) generalized evolution covers both the Coderack as well as the Workspace, but not the Slipnet.

The Slipnet acts as sort of a “Platonic Heaven” (Hofstadter’s term). It contains various kinds of abstract terms, where “abstract” simply means “not directly observable.” It is hence not comparable to those abstractions that can be used to build tree-like hierarchies. Think of the series “fluffy”-dog-mammal-animal-living entity. Significantly, the abstract terms in Copycat’s Slipnet also comprise concepts about relations, such as “right,” “direction,” “group,” or “leftmost.” Relations, however, are nothing else than even more abstract symmetries, that is transformational models, that may even build a mathematical group. Quite naturally, we could consider the items in Slipnet as a mathematical category (of categories). Again, Hofstadter and Mitchell do not refer in any way to such structures, quite unfortunately so.

The Slipnet’s items may well be conceived as instances of symmetry relations. Hofstadter treats them as idealizations of positional relations. Any of these items act as a structural property. This is a huge advance as compared to other models of analogy.

To summarize, we find two main elements in Copycat.

  • (1) restricted generalized evolution, and
  • (2) concrete instances of positional idealization.

Actually, these elements are top-level elements that must be conceived as compounds. In part 2 we will check out the elements of the Slipnet in detail, while the evolutionary aspects we already discussed in a previous chapter. Yet, this level of abstraction is necessary to render Copycat’s principles conceptually more mobile. In some way, we have to apply the principles of Copycat to the attempt to understand it.

The Copycat, released to the wild

Any generalization of Copycat has to withdraw the implicit constraints of its elements. In more detail, this would include the following changes:

  • (1) The representation of the items in the Slipnet could be changed into compounds, and these compounds should be expressed as “gene-like” entities.
  • (2) Introducing a mechanism to extend the Slipnet. This could be achieved through gene doubling in response to external pressures; yet, these pressures are not to be conceived as “external” to the whole system, just external to the Copycat. The pressures could be issued by a SOM. Alternatively, a SOM environment might also deliver the idealizations themselves. In either case, the resulting behavior of the Copycat has to be shaped by selection, either through internal mechanisms, or through environmentally induced forces (changes in the fitness landscape).
  • (3) The focus to positional idealization would have to be removed by introducing the more abstract notion of “symmetries”, i.e. mathematical groups or categories. This would render positional idealization just into a possible instance of potential idealization.

The resulting improvement of these changes would be dramatic. It would be not only much more easy to establish a Slipnet for any kind of domain, it also would allow the system (a CopyTiger?) to evolve new traits and capabilities, and to parametrize them autonomously. But these changes also require a change in the architectural (and mental) setup.

From Copycat to Metacat

Hofstadter himself tried to describe possible improvements of Copycat. A significant part of these suggestions for improvement is represented by the capability for self-monitoring and proliferating abstraction, hence he calls it “Metacat”.

The list of improvements comprises mainly the following five points (pp.315, chp.7).

  • (1) Self-monitoring of pressures, actions, and crucial changes as an explicit registering into parts of the Workspace.
  • (2) Disassembling of a given solution into the path of required actions.
  • (3) Hofstadter writes that “Metacat should store a trace of its solution of a problem in an episodic memory.
  • (4) A clear “meta-analogical” sense as an ability to see analogies between analogies, that is a multi-leveled type of self-reflectiveness.
  • (5) The ability to create and to enjoy the creation of new puzzles. In this context he writes “Indeed, I feel that responsiveness to beauty and its close cousin, simplicity, plays a central role in high-level cognition.

I am not really convinced of these suggestions, at least not if it would be implemented in the way that is suggested by Hofstadter “between the lines”. They look much more like a dream than a reasonable list of improvements, perhaps except the first one. The topic of self-monitoring has been explored by James  Marshall in his dissertation [1], but still his version of “Metacat” was not able to learn. This self-monitoring should not be conceived as a kind of Cartesian theater [2], perhaps even populated with homunculi on both sides of the stage.

The second point is completely incompatible with the architecture of Copycat, and notably Hofstadter does not provide even the tiniest comment on it. The third point violates the concept of “memory” as a re-constructive device. Hofstadter himself says elsewhere, while discussing alternative models of analogy, that the brain is not a database, which is quite correct. “Memory” is not a storage device. Yet, the consequence is that analogy making can’t be separated from memory itself (and vice versa).

The fourth suggestion, then, would require further platonic heavens, in case of Copycat/Metacat created by a programmer. This is highly implausible, and since it is a consequence of the architecture, the architecture of Copycat as such is not suitable to address real-world entities.

Finally, the fifth suggestion displays a certain naivity regarding either evolutionary contexts, to philosophical aspects of reasoning that are known since Immanuel Kant, or to the particular setup of human cognition, where emotions and propositional reasoning appear as deeply entangled issues.

The main Problem(s) of the FARG model

We already mentioned Copycat’s main problems, which are (i) the “Platonic heaven”, and (ii) the lack of the capability to learn as a kind of structural self-transformation.

Both problems are closely related. Actually, somehow there is only one single problem, and that’s the issue that Hofstadter got trapped by idealism. A Platonic heaven that is filled by the designer with an x-cat (or a Copy-x) is hard to comprehend. Even for the really small letter domain there are more than 60 of such idealistic, top-down and externally imposed concepts. These concepts have to be linked and balanced in just the right way, otherwise the capicut will not behave interesting in any way. Further more, the Slipnet is a structurally static entity. There are some parameters that change during its activity, but Copycat does not add new items to its Slipnet.

For these reasons it remains completely opaque, how Mitchell and Hofstadter arrived at that particular instance of the Slipnet for the letter domain, and thus it also remains completely unclear how the “computer” itself could build or achieve something like a Slipnet. Albeit Linhares [3] was able to implement an analogous FARG model for the domain of chess3, his model too suffers from the static Slipnet in the same way: it is extremely tedious to set up a Slipnet. Further more, the validation is even more laborious, if not impossible, due to the very nature of making analogies and the idealismic Slipnet.

The result is, well, a model that can not serve as a template for any kind of application that is designed to be able to adapt and to learn, at least if we take it without abstracting from it.

From an architectural point of view the Slipnet is simply not compatible to the rest of Copycat, which is strongly based on randomness and probabilistic processes in populations. The architecture of the Slipnet and the way it is used does not offer something like a probabilistic pathway into it. But why should the “Slipnet” not be a probabilistic process either?

Superior Aspects of the FARG model

Hofstadter clearly and correctly separates his project from connectionism (p.308):

Connectionist (neural-net) models are doing very interesting things these days, but they are not addressing questions at nearly as high a level of cognition as Copycat is, and it is my belief that ultimately, people will recognize that the neural level of description is a bit too low to capture the mechanisms of creative, fluid thinking. Trying to use connectionist language to describe creative thought strikes me as a bit like trying to describe the skill of a great tennis player in terms of molecular biology, which would be absurd.

A cornerstone in Hofstadter’s arguments and concepts around Copycat is conceptual slippage. This occurs in Slipnet and is represented as a sudden change in the weights of the items such that the most active (or influential) “neigh-borhood” also changes. To describe these neighborhoods, he invokes the concept of a halo. The “halo” is a more or less circular region around one of the abstract items in the Slipnet, yet without a clear boundary. Items in the Slipnet change their relative position all the time, thus their co-excitation also changes dynamically.

Hofstadter lists (p.215) the following missing issues in connectionist network (CN) models with regard to cognition, particularly with regard to concept slippage and fluid analogies.

  • – CN don’t develop a halo around the representatives of concepts in case of localist networks, i.e. node oriented networks and thus no slippability emerges;
  • – CN don’t develop a core region for a halo in case of networks where a “concept” is distributed throughout the network, and thus no slippability emerges;
  • – CN have no notion of normality due to learning that is instantiated in any encounter with data.

This critique appears both to be a bit overdone and misdirected. As we have seen above, Copycat can be interpreted as to comprise a slightly restricted case of generalized evolution. Standard neuronal techniques do not know of evolutionary techniques, there are no “coopetitioning” agents, and there is no separation into different memories of different durations. The abstraction achieved by artificial neuronal networks (ANN) or even by standard SOMs is always exhausted by the transition from extensional (observed items) to intensional description (classes, types). The abstract items in the Slipnet are not just intensional descriptions and could not be found/constructed by an ANN or a SOM that would work just on the observation, especially, if there is just a single observation at all!

Copycat  is definitely working in a different space as compared to network-based models.1 While the latter can provide the mechanisms to proceed from extensions to intensions in a “bottom-up” movement, the former is applying those intensions in a “top-down” manner. Saying this, we may invoke the reference to the higher forms of comparison and the Deleuzean differential. As many other things mentioned here, this would deserve a closer look from a philosophical perspective, which however we can’t provide here and now.

Nevertheless, Hofstadter’s critique of connectionist models seems to be closely related to the abandonment of modeling as a model for analogy making. Any of the three points above can be mitigated if we take a particular collection of SOM as a counterpart for Copycat. In the next section (which will be found in part II of this essay) we will see how the two approaches can inform each other.

Notes

1. We would like to point you to our discussion of non-Turing computation and else make you aware of the this conference: 11th International Conference on Unconventional Computation & Natural Computation 2012, University of Orléans, conference website.

2. Interestingly, Hofstadter’s PhD-student, co-worker and co-author Melanie Mitchell started to publish in the field of genetic algorithms (GA), yet, she never realized the kinship between GA and Copycat, at least she never said anything like this publicly.

3. He calls his model implementation “Capyblanca”; it is available through Google Code.

4. The example provided by Blank [4] where he tried to implement analogy making in a simply ANN is seriously deficient in many respects.

  • [1] James B. Marshall, Metacat: A Self-Watching Cognitive Architecture for Analogy-Making and High-Level Perception. PhD Thesis, Indiana University 1999. available online (last access 18/3/2012)
  • [2] Daniel Dennett, Consciousness Explained. 1992. p.107.
  • [3] Alexandre Linhares (2008). The emergence of choice: Decision-making and strategic thinking through analogies. available online.
  • [4] Douglas S. Blank, Implicit Analogy-Making: A Connectionist Exploration.
    Indiana University Computer Science Department. available online.

۞

Beyond Containing: Associative Storage and Memory

February 14, 2012 § Leave a comment

Memory, our memory, is a wonderful thing. Most of the time.

Yet, it also can trap you, sometimes terribly, if you use it in inappropriate ways.

Think about the problematics of being a witness. As long as you don’t try to remember exactly you know precisely. As soon as you start to try to achieve perfect recall, everything starts to become fluid, first, then fuzzy and increasingly blurry. As if there would be some kind of uncertainty principle, similar to Heisenberg’s [1]. There are other tricks, such as asking a person the same question over and over again. Any degree of security, hence knowledge, will vanish. In the other direction, everybody knows about the experience that a tiny little smell or sound triggers a whole story in memory, and often one that have not been cared about for a long time.

The main strengths of memory—extensibility, adaptivity, contextuality and flexibility—could be considered also as its main weakness, if we expect perfect reproducibility for results of “queries”. Yet, memory is not a data base. There are neither symbols, nor indexes, and at the deeper levels of its mechanisms, also no signs. There is no particular neuron that would “contain” information as a file on a computer can be regarded able to provide.

Databases are, of course, extremely useful, precisely because they can’t do in other ways as to reproduce answers perfectly. That’s how they are designed and constructed. And precisely for the same reason we may state that databases are dead entities, like crystals.

The reproducibility provided by databases expels time. We can write something into a database, stop everything, and continue precisely at the same point. Databases do not own their own time. Hence, they are purely physical entities. As a consequence, databases do not/can not think. They can’t bring or put things together, they do not associate, superpose, or mix. Everything is under the control of an external entity. A database does not learn when the amount of bits stored inside it increases. We also have to be very clear about the fact that a database does not interpret anything. All this should not be understood as a criticism, of course, these properties are intended by design.

The first important consequence about this is that any system relying just on the principles of a database also will inherit these properties. This raises the question about the necessary and sufficient conditions for the foundations of  “storage” devices that allow for learning and informational adaptivity.

As a first step one could argue that artificial systems capable for learning, for instance self-organizing maps, or any other “learning algorithm”, may consist of a database and a processor. This would represent the bare bones of the classic von Neumann architecture.

The essence of this architecture is, again, reproducibility as a design intention. The processor is basically empty. As long as the database is not part of a self-referential arrangement, there won’t be something like a morphological change.

Learning without change of structure is not learning but only changing the value of structural parameters that have been defined apriori (at implementation time). The crucial step however would be to introduce those parameters at all. We will return to this point at a later stage of our discussion, when it comes to describe the processing capabilities of self-organizing maps.1

Of course, the boundaries are not well defined here. We may implement a system in a very abstract manner such that a change in the value of such highly abstract parameters indeed involves deep structural changes. In the end, almost everything can be expressed by some parameters and their values. That’s nothing else than the principle of the Deleuzean differential.

What we want to emphasize here is just the issue that (1) morphological changes are necessary in order to establish learning, and (2) these changes should be established in response to the environment (and the information flowing from there into the system). These two condition together establish a third one, namely that (3) a historical contingency is established that acts as a constraint on the further potential changes and responses of the system. The system acquires individuality. Individuality and learning are co-extensive. Quite obviously, such a system is not a von Neumann device any longer, even if it still runs on a such a linear machine.

Our claim here is that the “learning” requires a particular perspective on the concept of “data” and its “storage.” And, correspondingly, without the changed concept about the relation between data and storage, the emergence of machine-based episteme will not be achievable.

Let us just contrast the two ends of our space.

  • (1) At the logical end we have the von Neumann architecture, characterized by empty processors, perfect reproducibility on an atomic level, the “bit”; there is no morphological change; only estimation of predefined parameters can be achieved.
  • (2) The opposite end is made from historically contingent structures for perception, transformation and association, where the morphology changes due to the interaction with the perceived information2; we will observe emergence of individuality; morphological structures are always just relative to the experienced influences; learning occurs and is structural learning.

With regard to a system that is able to learn, one possible conclusion from that would be to drop the distinction between storage of encoded information and the treatment of that  encodings. Perhaps, it is the only viable conclusion to this end.

In the rest of this chapter we will demonstrate how the separation between data and their transformation can be overcome on the basis of self-organizing maps. Such a device we call “associative storage”. We also will find a particular relation between such an associative storage and modeling3. Notably, both tasks can be accomplished by self-organizing maps.

Prerequisites

When taking the perspective from the side of usage there is still another large contrasting difference between databases and associative storage (“memories”). In case of a database, the purpose of a storage event is known at the time of performing the storing operation. In case of memories and associative storage this purpose is not known, and often can’t be reasonably expected to be knowable by principle.

From that we can derive a quite important consequence. In order to build a memory, we have to avoid storing the items “as such,” as it is the case for databases. We may call this the (naive) representational approach. Philosophically, the stored items do not have any structure inside the storage device, neither an inner structure, nor an outer one. Any item appears as a primitive qualia.

The contrast to the process in an associative storage is indeed a strong one. Here, it is simply forbidden to store items in an isolated manner, without relation to other items, as an engram, an encoded and reversibly decodable series of bits. Since a database works perfectly reversible and reproducible, we can encode the graphem of a word into a series of bits and later decode that series back into a graphem again, which in turn we as humans (with memory inside the skull) can interpret as words. Strictly taken, we do NOT use the database to store words.

More concretely, what we have to do with the items comprises two independent steps:

  • (1) Items have to be stored as context.
  • (2) Items have to be stored as probabilized items.

The second part of our re-organized approach to storage is a consequence of the impossibility to know about future uses of a stored item. Taken inversely, using a database for storage always and strictly implies that the storage agent claims to know perfectly about future uses. It is precisely this implication that renders long-lasting storage projects so problematic, if not impossible.

In other words, and even more concise, we may say that in order to build a dynamic and extensible memory we have to store items in a particular form.

Memory is built on the basis of a population of probabilistic contexts in and by an associative structure.

The Two-Layer SOM

In a highly interesting prototypical model project (codename “WEBSOM”) Kaski (a collaborator of Kohonen) introduced a particular SOM architecture that serves the requirements as described above [2]. Yet, Kohonen (and all of his colleagues alike) did not recognize so far the actual status of that architecture. We already mentioned this point in the chapter about some improvements of the SOM design; Kohonen fails to discern modeling from sorting, when he uses the associative storage as a modeling device. Yet, modeling requires a purpose, operationalized into one or more target criteria. Hence, an associative storage device like the two-layer SOM can be conceived as a pre-specific model only.

Nevertheless, this SOM architecture is not only highly remarkable, but we also can easily extend it appropriately; thus it is indeed so important, at least as a starting point, that we describe it briefly here.

Context and Basic Idea

The context for which the two-layer SOM (TL-SOM) has been created is document retrieval by classification of texts. From the perspective of classification,texts are highly complex entities. This complexity of texts derives from the following properties:

  • – there are different levels of context;
  • – there are rich organizational constraints, e.g. grammars
  • – there is a large corpus of words;
  • – there is a large number of relations that not only form a network, but which also change dynamically in the course of interpretation.

Taken together, these properties turn texts into ill-defined or even undefinable entities, for which it is not possible to provide a structural description, e.g. as a set of features, and particularly not in advance to the analysis. Briefly, texts are unstructured data. It is clear, that especially non-contextual methods like the infamous n-grams are deeply inappropriate for the description, and hence also for the modeling of texts. The peculiarity of texts has been recognized long before the age of computers. Around 1830 Friedrich Schleiermacher founded the discipline of hermeneutics as a response to the complexity of texts. In the last decades of the 20ieth century, it was Jacques Derrida who brought in a new perspective on it. in Deleuzean terms, texts are always and inevitably deterritorialized to a significant portion. Kaski & coworkers addressed only a modest part of these vast problematics, the classification of texts.

The starting point they took by was to preserve context. The large variety of contexts makes it impossible to take any kind of raw data directly as input for the SOM. That means that the contexts had to be encoded in a proper manner. The trick is to use a SOM for this encoding (details in next section below). This SOM represents the first layer. The subject of this SOM are the contexts of words (definition below). The “state” of this first SOM is then used to create the input for the SOM on the second layer, which then addresses the texts. In this way, the size of the input vectors are standardized and reduced in size.

Elements of a Two-Layer SOM

The elements, or building blocks, of a TL-SOM devised for the classification of texts are

  • (1) random contexts,
  • (2) the map of categories (word classes)
  • (3) the map of texts

The Random Context

A random context encodes the context of any of the words in a text. let us assume for the sake of simplicity that the context is bilateral symmetric according to 2n+1, i.e. for example with n=3 the length of the context is 7, where the focused word (“structure”) is at pos 3 (when counting starts with 0).

Let us resort to the following example, that take just two snippets from this text. The numbers represent some arbitrary enumeration of the relative positions of the words.

sequence A of words rel. positions in text “… without change of structureis not learning …”53        54    55    56       57 58     59
sequence B of words rel. positions in text “… not have any structureinside the storage …”19    20  21       22         23    24     25

The position numbers we just need for calculating the positional distance between words. The interesting word here is “structure”.

For the next step you have to think about the words listed in a catalog of indexes, that is as a set whose order is arbitrary but fixed. In this way, any of the words gets its unique numerical fingerprint.

Index Word Random Vector
 …  …
1264  structure 0.270    0.938    0.417    0.299    0.991 …
1265  learning 0.330    0.990    0.827    0.828    0.445 …
 1266  Alabama 0.375    0.725    0.435    0.025    0.915 …
 1267  without 0.422    0.072    0.282    0.157    0.155 …
 1268  storage 0.237    0.345    0.023    0.777    0.569 …
 1269  not 0.706    0.881    0.603    0.673    0.473 …
 1270  change 0.170    0.247    0.734    0.383    0.905 …
 1271  have 0.735    0.472    0.661    0.539    0.275 …
 1272  inside 0.230    0.772    0.973    0.242    0.224 …
 1273  any 0.509    0.445    0.531    0.216    0.105 …
 1274  of 0.834    0.502    0.481    0.971    0.711 …
1274  is 0.935    0.967    0.549    0.572    0.001 …
 …

Any of the words of a text can now be replaced by an apriori determined vector of random values from [0..1]; the dimensionality of those random vectors should be around  80 in order to approximate orthogonality among all those vectors. Just to be clear: these random vectors are taken from a fixed codebook, a catalog as sketched above, where each word is assigned to exactly one such vector.

Once we have performed this replacement, we can calculate the averaged vectors per relative position of the context. In case of the example above, we would calculate the reference vector for position n=0 as the average from the vectors encoding the words “without” and “not”.

Let us be more explicit. For example sentence A we translate first into the positional number, interpret this positional number as a column header, and fill the column with the values of its respective fingerprint. For the 7 positions (-3, +3) we get 7 columns:

sequence A of words “… without change of structure is not learning …”
rel. positions in text        53        54    55    56       57 58     59
 grouped around “structure”         -3       -2    -1       0       1    2     3
random fingerprints
per position
0.422  0.170  0.834  0.270  0.935  0.706  0.330
0.072  0.247  0.502  0.938  0.967  0.881  0.990
0.282  0.734  0.481  0.417  0.549  0.603  0.827

…further entries of the fingerprints…

The same we have to do for the second sequence B. Now we have to tables of fingerprints, both comprising 7 columns and N rows, where N is the length of the fingerprint. From these two tables we calculate the average value and put it into a new table (which is of course also of dimensions 7xN). Such, the example above yields 7 such averaged reference vectors. If we have a dimensionality of 80 for the random vectors we end up with a matrix of [r,c] = [80,7].

In a final step we concatenate the columns into a single vector, yielding a vector of 7×80=560 variables. This might appear as a large vector. Yet, it is much smaller than the whole corpus of words in a text. Additionally, such vectors can be compressed by the technique of random projection (math. foundations by [3], first proposed for data analysis by [4], utilized for SOMs later by [5] and [6]), which today is quite popular in data analysis. Random projection works by matrix multiplication. Our vector (1R x  560C) gets multiplied with a matrix M(r) of 560R x 100C, yielding a vector of 1R x 100C. The matrix M(r) also consists of flat random values. This technique is very interesting, because no relevant information is lost, but the vector gets shortened considerable. Of course, in an absolute sense there is a loss of information. Yet, the SOM only needs the information which is important to distinguish the observations.

This technique of transferring a sequence made from items encoded on an symbolic level into a vector that is based on random context can be applied to any symbolic sequence of course.

For instance, it would be a drastic case of reductionism to conceive of the path taken by humans in an urban environment just as a sequence locations. Humans are symbolic beings and the urban environment is full of symbols to which we respond. Yet, for the population-oriented perspective any individual path is just a possible path. Naturally, we interpret it as a random path. The path taken through a city needs to be described both by location and symbol.

The advantage of the SOM is that the random vectors that encode the symbolic aspect can be combined seamlessly with any other kind of information, e.g. the locational coordinates. That’s the property of the multi-modality. Which particular combination of “properties” then is suitable to classify the paths for a given question then is subject for “standard” extended modeling as described inthe chapter Technical Aspects of Modeling.

The Map of Categories (Word Classes)

From these random context vectors we can now build a SOM. Similar contexts will arrange in adjacent regions.

A particular text now can be described by its differential abundance across that SOM. Remember that we have sent the random contexts of many texts (or text snippets) to the SOM. To achieve such a description a (relative) frequency histogram is calculated, which has as much classes as the SOM node count is. The values of the histogram is the relative frequency (“probability”) for the presence of a particular text in comparison to all other texts.

Any particular text is now described by a fingerprint, that contains highly relevant information about

  • – the context of all words as a probability measure;
  • – the relative topological density of similar contextual embeddings;
  • – the particularity of texts across all contextual descriptions, again as a probability measure;

Those fingerprints represent texts and they are ready-mades for the final step, “learning” the classes by the SOM on the second layer in order to identify groups of “similar” texts.

It is clear, that this basic variant of a Two-Layer SOM procedure can be improved in multiple ways. Yet, the idea should be clear. Some of those improvements are

  • – to use a fully developed concept of context, e.g. this one, instead of a constant length context and a context without inner structure;
  • – evaluating not just the histogram as a foundation of the fingerprint of a text, but also the sequence of nodes according to the sequence of contexts; that sequence can be processed using a Markov-process method, such as HMM, Conditional Random Fields, or, in a self-similar approach, by applying the method of random contexts to the sequence of nodes;
  • – reflecting at least parts of the “syntactical” structure of the text, such as sentences, paragraphs, and sections, as well as the grammatical role of words;
  • – enriching the information about “words” by representing them not only in their observed form, but also as their close synonyms, or stuffed with the information about pointers to semantically related words as this can be taken from labeled corpuses.

We want to briefly return to the first layer. Just imagine not to measure the histogram, but instead to follow the indices of the contexts across the developed map by your fingertips. A particular path, or virtual movement appears. I think that it is crucial to reflect this virtual movement in the input data for the second layer.

The reward could be significant, indeed. It offers nothing less than a model for conceptual slippage, a term which has been emphasized by Douglas Hofstadter throughout his research on analogical and creative thinking. Note that in our modified TL-SOM this capacity is not an “extra function” that had to be programmed. It is deeply built “into” the system, or in other words, it makes up its character. Besides Hofstadter’s proposal which is based on a completely different approach, and for a different task, we do not know of any other system that would be able for that. We even may expect that the efficient production of metaphors can be achieved by it, which is not an insignificant goal, since all the practiced language is always metaphoric.

Associative Storage

We already mentioned that the method of TL-SOM extracts important pieces of information about a text and represents it as a probabilistic measure. The SOM does not contain the whole piece of text as single entity, or a series of otherwise unconnected entities, the words. The SOM breaks the text up into overlapping pieces, or better, into overlapping probabilistic descriptions of such pieces.

It would be a serious misunderstanding to perceive this splitting into pieces as a drawback or failure. It is the mandatory prerequisite for building an associative storage.

Any further target oriented modeling would refer to the two layers of a TL-SOM, but never to the raw input text.Such it can work reasonable fast for a whole range of different tasks. One of those tasks that can be solved by a combination of associative storage and true (targeted) modeling is to find an optimized model for a given text, or any text snippet, including the identification of the discriminating features.  We also can turn the perspective around, addressing the query to the SOM about an alternative formulation in a given context…

From Associative Storage towards Memory

Despite its power and its potential as associative storage, the Two-Layer SOM still can’t be conceived as a memory device. The associative storage just takes the probabilistically described contexts and sorts it topologically into the map. In order to establish “memory” further components are required that provides the goal orientation.

Within the world of self-organizing maps, simple (!) memories are easy to establish. We just have to combine a SOM that acts as associative storage with a SOM for targeted modeling. The peculiar distinctive feature of that second SOM for modeling is that it does not work on external data, but on “data” as it is available in and as the SOM that acts as associative storage.

We may establish a vivid memory in its full meaning if we establish two further components: (1) targeted modeling via the SOM principle, (2) a repository about the targeted models that have been built from (or using) the associative storage, and (3) at least a partial operationalization of a self-reflective mechanism, i.e. a modeling process that is going to model the working of the TL-SOM. Since in our framework the basic SOM module is able to grow and to differentiate, there is no principle limitation of/for such a system any more, concerning its capability to build concepts, models, and (logical) habits for navigating between them. Later, we will call the “space” where this navigation takes place “choreosteme“: Drawing figures into the open space of epistemic conditionability.

From such a memory we may expect dramatic progress concerning the “intelligence” of machines. The only questionable thing is whether we should call such an entity still a machine. I guess, there is neither a word nor a concept for it.

u .

Notes

1. Self-organizing maps have some amazing properties on the level of their interpretation, which they share especially with the Markov models. As such, the SOM and Markov models are outstanding. Both, the SOM as well as the Markov model can be conceived as devices that can be used to turn programming statements, i.e. all the IF-THEN-ELSE statements occurring in a program as DATA. Even logic itself, or more precisely, any quasi-logic, is getting transformed into data.SOM and Markov models are double-articulated (a Deleuzean notion) into logic on the one side and the empiric on the other.

In order to achieve such, a full write access is necessary to the extensional as well as the intensional layer of a model. Hence, artificial neuronal networks (nor, of course, statistical methods like PCA) can’t be used to achieve the same effect.

2. It is quite important not to forget that (in our framework) information is nothing that “is out there.” If we follow the primacy of interpretation, for which there are good reasons, we also have to acknowledge that information is not a substantial entity that could be stored or processed. Information is nothing else than the actual characteristics of the process of interpretation. These characteristics can’t be detached from the underlying process, because this process is represented by the whole system.

3. Keep in mind that we only can talk about modeling in a reasonable manner if there is an operationalization of the purpose, i.e. if we perform target oriented modeling.

  • [1] Werner Heisenberg. Uncertainty Principle.
  • [2] Samuel Kaski, Timo Honkela, Krista Lagus, Teuvo Kohonen (1998). WEBSOM – Self-organizing maps of document collections. Neurocomputing 21 (1998) 101-117.
  • [3] W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189–206. Amer. Math. Soc., 1984.
  • [4] R. Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. In J.M. Zurada, R.J. Marks II, and C.J. Robinson, editors, Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994.
  • [5] Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. Proceedings of the Seventeenth ACM Symposium on the Principles of Database Systems (pp. 159-168). ACM press.
  • [6] Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: Applications to image and text data. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 245-250). ACM Press.

۞

The Self-Organizing Map: SOMe Design Issues

February 4, 2012 § 1 Comment

It is the duality of persistent, quasi-material yet simulated structures

and the highly dynamic, volatile and-most salient-informational aspects that are so characteristic for learning entities like Self-Organizing Maps (SOM) or Artificial Neural Networks (ANN). It should not be regarded as a surprise that the design of manifold aspects of the persistent, quasi-material part of SOM or ANN is quite influential and hence also important.

Here we explore some of the aspects of that design. Sure, there is something like a “classic” version of the SOM, named after its inventor, the so-called “Kohonen-SOM.” Kohonen developed several slightly different SOM mechanisms over many years, starting with statistical covariance matrices. All of them comprise great ideas, for sure. Yet, in a wider perspective it is clear that there are many properties of the SOM that are presumably quite sub-optimal for realizing a generally applicable learning mechanism.

The Elements of SOMs

We shall recapitulate very briefly the principle of SOM below, more detailed descriptions can be found in many places in the Web (one of the best for the newbie, with some formulas and a demo software: ai-junkie), see also our document here that relates some issues to references, as well as our intro in plain language.

Yet, the question beyond all the mathematical formula stuff is: “What are the elements of a SOM?”

We propose to distinguish the following four basic elements:

  • (1) a Collection of Items
    that have memory for observations, or reflecting them, where all the items start with the same structure for these observations (items are often called “nodes”, or in a more romantic attitude “neurons”);
  • (2) the Spatial Layout Principles
    and the relational arrangement of this items;
  • (3) an Influence Mechanism
    that link the items together, and which together with the spatial layout defines the topology of the piece;
  • (4) a Perceptional Mechanism
    that introduces observations into the SOM in a particular manner.

In the case of the SOM these elements are configured in a way that creates a particular class of “learning” that we can describe as competitive-collaborative abstraction.

Those basic elements of a SOM can be parameterized—and thus also implemented—in very different ways. If we would take only the headlines of that list we could also subsume artificial neural networks (ANN) with these elements. Yet, even the items of a SOM and those of a ANN are drastically different. Else, the meaning of concepts like “layout” or “influence mechanism” are very different. This results in a completely different architecture regarding the relation of the “data”, or if you like potential observations, and the structure (SOM or ANN). Basically, ANNs are analytic,which means that the abstraction is (has to be done) done before the interaction of the structure with the data. In strong contrast to this approach, SOM build up an abstraction while interacting with the data. This abstraction is mostly consisting of the transition from extensional data to intensional representation. Thus SOM are able to find a structure, while ANN only can move within the apriori defined structure. In contrast to ANN, SOM are associative mechanisms (which is the reason why we are so fond of them)

Yet, it is also true for SOMs that the parametrization of the instances of the four elements as listed above have a significant influence on the capabilities and the potential of the resulting actual associative structure. Note that the design of the internals of the SOM does not refer to the issues of the usage or the embedding of the SOM into a wider context of modeling, or the structure of modeling itself.

In the following we will discuss the usual actualizations of those four elements, the respective drawbacks and better alternatives.

The SOM itself

Often one can find schematic representations like the one shown in the following figure 1:

Then this is usually described in this way: “The network is created from a 2D lattice of ‘nodes’, each of which is fully connected to the input layer.

Albeit this is a possible description, it is a highly misleading one, with some quite unfavorable consequences: as we will see, it hides some important opportunities offered by the SOM mechanism.

Instead of speaking in an opaque manner about the “input layer” we simply can use the concept of “structured observations”. The structure is just given by the features used to establish or describe the observations. The important step that simplifies everything is to give all the nodes the same structure as the observations, at least in the beginning and as the trivial case; we will see that both assumptions may “develop away” as an effect of self-organization.

Anyway, the complicated connectivity in figure 1 changes into the following structure for the simple case:

Figure 2: An interpretation of the SOM grid, where the nodes are stuffed with the same structure (ordered set of variables) as the observations. This interpretation allows for a localizing of structures that is not achievable by the standard interpretation as shown in Fig.1.

To see what we gain by this change we have to visit briefly and partially the SOM mechanism.

The SOM mechanism compares a particular “incoming” observation to “all” nodes and determines a best matching node. The intensional part of this node then gets changed as a function of the given weight vector and the new observation. Some kind of intermediate between the observational vector and the intensional vector of the node is established. As a consequence, the nodes develop different intensional descriptions. This change upon matching with an observation then will be spread in the vicinity of the selected node, decaying with the distance, while this distance additionally is shrinking with increasing duration of the learning process. This is called the lateral control mechanism (LCM) by Kohonen (see Kohonen’s book 2001 p.179). This LCM is one of the most striking differences to so-called artificial neural networks (ANN).

It is now rather straightforward to think that the node keeps the index of the matching observation in its local memory. Over the course of learning, a node collects many records, which are all similar. This gathering of observations into an explicit collection is one of the MOST salient differences of our interpretation of the SOM to most of the standard interpretations! 

Figure 3: As Fig.2, showing the extensional container of one of the nodes.

The consequences are highly significant: The SOM is not a tool for visualization any more, it is a mechanism with inherent and nevertheless transparent abstraction! To be explicit: While we retain the full power of the SOM mechanism we also not only get an explicit clustering, but even the opportunity for a fully validated modeling, inclusive a full description of the structure of the risk of mis-classification, hence there is no “black box” any more (as in contrast say to ANN, or even statistical methods).

Now we can see what we gained from changing the description, dropping the unholy concept of “input layer.” It now becomes clearly visible that nodes can be conceived of as containers, comprised of an extensional and an intensional part (as Carnap used the terms). The intensional part is what usually is called the weight vector of a node.The extensional part is the list of observations matching this intension.

The intensional part of a node thus represents a type. The extensional part of our revised SOM node represents the matching tokens.

But wait! As it is usual done, we called the intensional part of the node the “weight vector”. Yet, this is a drastic misnomer. It is not “weights” of the variables. It is simply a value that can be calculated in different ways, and which is influenced from different sides. It is a function of

  • – the underlying extensional part = the list of records;
  • – the similarity functional that is used for this node
  • – the general network dynamics;
  • – any kind of dynamic rule relating the new observation.

It is thus much more adequate to talk about an “intensionality profile” than about weights. Of course, we can additionally introduce real “weights” for each of the positions in a structure profile vector.

A second important advance of dropping this bad concept of “input layer” is that we can localize this function that results in the actualization of the intensional part of the node. For instance, we can localize the similarity function. As part of the similarity function we could even consider to implement a dynamic rule (dependent on the extensional content of the node) that excludes certain positions = variables as arguments from the determination of the similarity!

The third important consequence is that we created a completely new compartment, the “extensional container” of a node. Using the concept of “input layer” this compartment is simply not visible. Thus, the concept of the input layer violates central insights from the theory of epistemic action.

This “extensional container” is not just a list of records. We can conceive it as a “functional” compartment, that allows for a great deal of new flexibility and dynamics. This inner dynamics could be used to create new elements of the intensional part of the node, e.g. about the variance of the tokens contained in the “extensionality container”. Or about their relation as measured by the correlation. In fact, we could use any mechanism to create new positions in the intensional profile of node, even the properties of an embedded SOM, a small population of artificial neurons, the result parameters of statistical functions taking the list of observations as input and so on.

It is quite important to understand that the particular dynamics in the extensionality container is purely local. Notably the possibility for this dynamics also makes it possible to implement local differentiation of the SOM network, just as it is induced by the observations itself.

There is even a fourth implication of dropping the concept of input layer, which lead us to the separation between intensional and extensional aspects. This implication concerns the numerical production of the intensionality profile. Obviously we can regard the transition from the extensional description to the intensional representation. This abstraction, as any, is accompanied by a loss of information. Referring to the collection of intensional representations means to use them as a model. It is now very important to recognize that there is no explicit down-stream connection to the observations any more. All we have at our disposal are intensional representations that emerged as a consequence of the interaction of three components: (1) the observations, (2) the quasi-material aspects of the modeling procedure(particularly the associative part of it, of course), and (3) the imposed target/risk settings.

As a consequence we have to care explicitly about the variance structure within the extensional containers. More precisely, the internal variance of the extensional containers have to be “comparable.” If we would not care about that, we could not consider the intensional representations as comparable. We simply would compare apples with oranges, since some of the intensional representations simply would represent “a large mess”. On the level of intensionality profile one can’t see the variance anymore, hence we have to avoid the establishment of extensional groups (“micro-clusters”) that do not collect observations that are “similar” with regard to their descriptional values vector (inside the apriori given space of assignates). Astonishingly, this requirement of a homogenized extensional variance measure is overlooked even by Kohonen and his group, not to mention the implementations by countless epigonal fellows. It is clear that only the explicit distinction between intensional and extensional part of a model allows for the visibility of this important structural element.

Finally, and as a fifth consequence, we would like to emphasize that the explicit distinction between intensional and extensional parts opens the road towards a highly interesting region. We already mentioned that the transition from extensional description to intensional representation is a kind of abstraction. Yet, it is a simple kind of abstraction, closely tied to quasi-material aspects of the associative mechanism.

We may, however, easily derive the production of idealistic representations from that, if not even to say “ideas” in the philosophical sense. To achieve that we just have to extend the SOM with a production facility, the capability to simulate. This is of course not a difficult task. We will describe the details elsewhere (essay is scheduled), thus just a brief outline here. The “trick” is to use the intensional representations as seeds for generating surrogate observations by means of a Monte-Carlo simulation, such that the variance of the observations is a bit smaller than that of the empiric observations. Both, the empiric and surrogated “data” (nothing is “given” in the latter case) share the same space of assignates. The variance threshold can be derived dynamically from the SOM itself, it need not be predetermined at implementation time. As the next step one drops the extensional containers of the SOM and feeds the simulated data into it. After several loops of such self-referential modeling the intensional descriptions have “lost” their close ties to empirical data, yet, they are not completely unrelated. We still may use it as a kind of “template” in modeling, or for instance as a kind of null-model. In other words, the SOM contains the first traces of Platonic ideas.

Modeling. What else?

Above we emphasized that the SOM provides the opportunity for a fully validated modeling if we distinguish explicitly intensional and extensional parts in the make-up of the nodes. The SOM is, however, a strange thing, that can act in completely different ways.

In the chapter about modeling we concluded that a model without a purpose is not a model, or it is at most a strongly deficient model. Nevertheless, many people claim to create models without implying a purpose to the learning SOM. They call it “unsupervised clustering”. This is, of course, nonsense. It should be called more appropriately, “clustering with a deliberately hidden purpose,” since all the parameters of the SOM mechanisms and even the implementation act as constraints for the clustering, too. Any clustering mechanism applies a lot of  criteria that influence the results. These constraints are supervised by the software, and the software has been produced by a human being (often called programmer), so this human being is supervising the clustering with a long arm. For the same reason one can not say the SOM is learning something and also not that we would train the SOM, without giving it a purpose.

Though the digesting of information by a SOM without a purpose being present is neither modeling nor learning, what can we conceive such a process as then?

The answer is pretty simple, and remember it becomes visible only after having dropped illegitimate ascriptions of mistaken concepts. This clustering has a particular epistemological role:

Self-organizing Maps that are running without purpose (i.e. target variables) are best described as associative storage devices. Nothing more, but above all, also nothing less.

Actually, this has to be rated as one of the greatest currently unrecognized opportunities in the field of machine learning. The reason is again inadequate wording. Of course, the input for such a map should be probabilized (randomized), and it has been already demonstrated how to accomplish this… guess by whom… by Teuvo Kohonen himself, while he was inventing the so-called WebSom. Kohonen proposed random neighborhoods for presenting snippets of texts to the SOM, which are a simple version of random contexts.

Importantly, once one recognizes the categorical differences between the target oriented modeling and the associative storage, it becomes immediately clear that there are strictly different methodological, hence quasi-morphological requirements. Astonishingly, even Kohonen himself, and any of his fellows as well, did not recognize the conceptual difference between the two flavors. He used SOMs created without target variable, i.e. without implying a purpose, as models for performing selections. Note that the principal mechanism of the SOM is the same for both approaches. There are just differences in the cost function(s) regarding the selection of variables.

There should be no doubt that any system intended to advance towards an autonomous machine-based episteme has to combine the two mechanism. There are sill other mechanisms, such like virtual movements, or virtual sequences in the abstract SOM space (we will describe that elsewhere), or the self-referential SOM for developing “crisp ideas”, but such a combination of associative storage and target oriented modeling is definitely inevitable (in our perspective… but we have strong arguments!).

SOM and Self-Organization

A small remark should be made here: Self-organizing maps are not in the same strong sense self-organizing as for instance Turing systems, or other Reaction-Diffusion Systems (RDS). A SOM gets organized by the interaction of its mechanisms and structures and the data. A SOM does not create patterns by it-SELF. Without feeding data into it, nothing happens, in stark contrast to self-organizing systems in the strong sense (see the example we already cited here), or take a look here from where we reproduced this parameter map for Gray-Scott Models.

Figure 4: The parameter map for Gray-Scott models, a particular Reaction-Diffusion System. Only for certain combinations of the two parameters of the system interesting patterns appear, and only for part of them the system remains dynamical, i.e. changing the layout of the patterns continuously.

As we discuss it in the chapter on complexity, it is pretty clear which kind of conditions must be at work to create the phenomenon of self-organization. None of them is present in Self-Organizing Maps; above all, SOMs are neither dissipative, nor are there antagonist influences.

Yet, it is not too difficult to create a self-organizing map that is really self-organizing. What is needed is either a second underlying process or inhibitory elements organized as population. In natural brains, we find both kinds of processes. The key for choosing the right starting point for implementing a system that is showing the transition from SOM to RDS is the complete probabilization of the idea of the network.

Our feeling is that at least one of them is mandatory in order to allow the system to develop logic as a category in an autonomous manner, i.e. not pre-programmed. As any other understanding, the ability to think in logical terms, or using logic as a category should not be programmed into a computer. That ability should emerge from the implemented conditions. Our claim that some concept is quite the opposite to something other is quite likely based on such processes. It is highly indicate in this context that the brain is indeed showing Turing patterns on the level of activity patterns, i.e. the patterns are not made of material entities, but are completely immaterial. Else, like in chemical clocks like the Belousov-Zhabotinsky system, another RDS, the natural brain shows a strong rhythmicity, both in its “local” activity patterns, as well as in the overall activity, affecting billions of cells at a time.

So far, the strong self-organization is not implemented in our FluidSOM.

Spatial Layout Principles

The spatial layout principle is a very important design aspect. It concerns not only the actual geometrical arrangement of nodes, but also their mobility as representations of physical entities. In the case of SOM this has to be taken quite abstract. The “physical entities” represented by the nodes are not neurons. The nodes represent functional roles of populations of neurons.

Usually, the SOM is defined as a collection of nodes that are arranged in a particular topology. This topology may be

  • – grid like, 2-(3) dimensional;
  • – as kind of a swarm in 2 dimensions;
  • – as a gas, freely moving nodes.

The obvious difference between them is the degree of physical freedom for the nodes to move around. In grids, nodes are fixed and cannot  move, while in the SOM gas the “nodes” are much more mobile.

There is also a quite important, yet not so obvious commonality between them. Firstly, in all of these layout principles the logical SOM nodes are identical with the “physical” items, i.e. representations of crossings in a grid, swarming entities, or gaseous containers. Thus, the data aspect of the nodes is not cleanly separated from its spatial behavior. If we separate it, the behavior of the nodes and the spatial aspects can be handled more transparently, i.e. the relevant parameters are better accessible.

Secondly, the space where those nodes are embedded is conceived as being completely neutral, as if those nodes would be arranged in deep space. Yet, everything we know of learning entities points to their mediality. In other words, the space that embeds the nodes should not be “empty”.

Using a Grid

In most of the cases the SOM is defined as a collection of nodes that are arrangement as a regular grid (4(8)n, 6n). Think of it as a fixed network like a regular wire fence, or the atomic bonds in a model of a crystal.

This layout is by far the most abundant one, yet it is the most restricted one. It is almost impossible, at least very difficult to make such a SOM dynamic, e.g. to provide it the potential to grow or to differentiate.

The advantage of grids is that it is quite easy to calculate the geometrical distance between the nodes, which is a necessary step to determine the influence between any two nodes. If the nodes are mobile, this measurement requires much much more efforts in terms of implementation.

Using Metaphors for Mobility: Swarms, or Gases

Here, the nodes may range freely. Their movement is strongly influenced (or even) restricted by the moves of its neighbors. Here, experience tells us the flocks of birds, or fishes, or bacteria, do not learn efficiently on the level of the swarm. Structures are destroyed to easy. The same is true for the gas metaphor.

Flexible Phase in a Mediating Space

Our proposal is to render the “phase” flexible according to the requirements that are important in a particular stage of learning. The nodes may be strictly arranged like in a crystal, or quite mobile, they may move around according to physical forces or according to their informational properties like the gathered data.

Ideally, the crystalline phases and the fluid phases are dependent on just a two or three parameters. One example for this is the “repulsive field”, a collection of items in a 2D space which repel each other. If the kinetic energy of those items is not too large, and the range of repellent force is not too low, this automatically leads to a hexagonal pattern. Yet, the pattern is not programmed as an apriori pattern. It is a result of properties of the items (and the embedding space). Such, the emergent arrangement is never affected by something like a “layout defect.”

Inserting a new item or removing one is very easy in such a structure. More important, the overall characteristics of the system does not change despite the fact that the actual pattern changes.

The Collection of Items : “Nodes”

In the classic SOM, nodes serve a double purpose:

  • P1 – They serve as container for references that point to records of data (=observations);
  • P2 – They present this extensional list in an integrated, “intensional” form ;

The intensional form of the list is simply the weight vector of that node. In the course of learning, the list of the records contained in a particular node will be selected such that they are increasingly similar.

Note that keeping the references to the data records is extremely important. It is NOT part of most SOM implementations. If we would not do it, we could not use the SOM as a modeling tool at all. This might be the reason why most people use the SOM just as visualization tool for data (which is a dramatic misunderstanding)

The nodes are not “directly” linked. Whether they influence each other or not is dependent on the distance between them and the neighborhood function. The neighborhood function determines the neighborhood, and it is a salient property of the SOM mechanism that this function changes over time. Important for our understanding of machine-based epistemology is that the relations between nodes in a SOM are potentially of a probabilistic character.

However, if we use a fixed grid, a fixed distance function, and a deterministically behaving neighborhood function, the resulting relations are not probabilistic any more.

Else, in case of default SOM, the nodes are passive. They even do not perform the calculation of the weight vector, which is performed by a central “update” loop in most implementations. In other words, in a standard SOM a node is a data structure.Here we arrive at a main point in our critique of the SOM

The common concept of a SOM is equivalent to a structural constant.

What we need, however, is something completely different. Even on the level of the nodes we need entities, that can change their structure and their relationality.

The concept of FluidSOM must be based on active nodes.

These active nodes are semi-autonomous. They calculate the weight vector themselves, based either on new input data, or some other “chemical” influences. They may develop a few long-range outgoing fibers or masses of more or less stable (but not “fixed”!) input relations to other nodes. The active meta-nodes in a fluid self-organizing map  may develop a nested mini-SOM, or may incorporate any other mechanism for evaluating the data to which it is pointing to, e.g. a small neural network of a fixed structure (see mn-SOM). Meta-nodes also may branch out a further SOM instances locally into relative “3D”, e.g. dependent on its work load, or again, on some “chemical influences”

We see, that meta-nodes are dynamic structures, sth like a category of categories. This flexibility is indispensable for growing and differentiation.

This introduces the seed of autonomy on the lowest possible level. Here, within the almost material processes, it is barely autonomy, it is really a mechanic activity. Yet, this activity is NOT triggered by some reason any more. It is just there, as a property of the matter itself.

We are convinced that the top-level behavioral autonomy is (at least for large parts) an emergent property that grows out of the a-reasonable activity on the micro-material level.

Data, Reflection

The profile vector of a SOM node usually contains for all mutable variables (non-ID/TV) the average of the values in the extensional list. That is, the profile vector itself does not know anything about TV or index variable…  which is solely the business of the Node.
In our case, however, and based on the principle of “strict locality,” the weight vector also may contain a further section, which is referring to dynamic properties of the node, or the data. We introduced this in a different way above when discussing the extensionality container of SOM nodes. For instance, the deviation of the data in the node against a model function (such as a correlation) such internal measurements can not be predefined, and they are also not stable input data since they are constantly changing (due to the list of data in the node, the state of other  nodes etc.).

This introduces the possibility of self-referentiality on the lowest possible level. Similar to the case of autonomy, we find the seed for self-referentiality on the topmost-level (call it consciousness…) in midst the material layer.

Programming Style

If there is one lesson we can draw from the studies of naturally occurring brains, then it is the fact that there is no master code between neurons, no “Mentalese.” The brain does not work on the base of its own language. Equivalently, there are no logical circuits implementing logic calculus. As a correlate we can say that the brain is not a thing that consists of a definite wiring. A brain is not a finite state automaton, it does not make any sense to ascribe states to brains. Instead, everything going on in a brain is probabilistic, even on the sub-cellular level. It is not determined in a definite manner, how many vesicles have to burst in a synaptic gap to cause a transmission of the signal, it is not determined how many neurons exactly make up a working group for a particular “function” etc.etc. The only thing we can say is that certain fibers collect from certain “regions”, typically millions of neurons, to other such regions.

Note that any software program IS representable by just such a definite wiring. Hence, what we need is a mechanism that can transcend its own being as mechanism. We already discussed this issue in another chapter, where we identified abstract growth as a possible route to that achievement.

The processing of information in the brain is probabilistic, despite the fact that on the top level it “feels” different for us. Now, when starting to program artificial associative structures that are able to do similar things as a brain can accomplish, we have to respect this principle of probabilization.

We not only have to avoid hard-coded wiring between procedures. We have to avoid any explicit wiring at all. In terms of software architecture this translates into the proposal that we should not rely just on object-oriented programming (OOP). For instance, we would represent nodes in a SOM as objects, and the properties of these objects again would be other objects. OOP is an important, but certainly not a sufficient design element for a machine that shall develop its own episteme.

What we have to actualize in our implementation is not just OOP, but a messaging based architecture, where all elements are only loosely coupled. The Lateral Control Mechanism (LCM) of the Kohonen SOM is a nice example for this, the explicit wiring in ANN is perfect counter-example, a DON’T DO IT. Yet, as we will see in the next section, the LCM should not be considered as a symmetric and structurally constant functional entity!

Concerning programming style, on an even lower level this translates into the heavy use of so-called interfaces, as they are so prevalent in Java. Not objects are wired or passed around, but only interfaces. Interfaces are forward contracts about the standards for the interaction of different parts, that actually can change while the “program” is running.

Of course, these considerations regard only to the lowest, indeed material levels of an associative system, yet, they are necessary. If we start with wires of any kind, we won’t achieve our goals. From the philosophical perspective it does not come as a surprise that the immanence of autonomous abstraction is to be found only in open processes, which include the dimension of mediality. Even in the interaction of its tiniest parts the system should not rely on definite encodings.

Functional Differentiation

During their development, natural systems differentiate in their parts. Bodies are comprised of organs, organs are made of different cell types, within all members of a cell a further differentiation of their actual and context-specific role may occur. The same can be observed in social insects, or any other group of social beings. They are morphologically almost identical, yet, their experience let them do their tasks differentially, or even let them do different tasks. Why then should we assume that all neurons in a large compound should act absolutely equally?

To illustrate the point we should visit a particular African termite species (Schedorhinotermes lamanianus) on which I worked as a young biologist. They are feeding on rodden/rodding wood. Well, since these pieces of wood are much larger than the termites, a problem occurs. The animals have to organize their collective foraging, i.e. where to stay and gnaw onto the wood, and where to travel to return the harvested pieces back to home nest, where they then put it to a processing chamber stuffed with a special kind of fungus. The termites then actually feed that fungus, and mostly not the wood. (though they have also bacteria in their gut to do the job of digesting the cellulose and the lignine.

Important for us is the foraging process. To organize gnawing sites and traveling routes they use pheromones, and no wonder, they use just 2 for that, which build a Turing system, as I proofed with a small bio-test together with a colleague.

In the nervous system of animals we find a similar problematics. The brain is not just a large network, over and over symmetric like a crystal. Of course not. There are compartments (see our chapter about complexity), there are fibers. The various parts of the brain even differ strongly with respect to their topological structure, their “wiring”. Why the heck should an artificial system look like a perfect crystal? In a crystal their will be no stable emergence, hence no structural learning. By the way, we should not expect structural learning in swarms either, for a very similar reason, albeit that reason instantiates in the opposite manner: complete perturbation prevents the emergence of compartments, too, hence no structural learning will be observed (That’s the reason why we do not have swarms in the skull…)

Back to our neurons. We reject the approach of a direct representational simulation of neurons, or parts of the brain. Instead we propose to focus the principles as elements of construction. Any system that is intended to show structural learning, is in urgent need of the basic differentiation into “local” and “tele” (among others). Here we meet even a structural parallelism to large urban compounds.

We can implement the emergence of such fibers in a straightforward manner, if we make it dependent on the occurrence of reproducing / repeating co-excitation of regions. This implies that we have to soften the SOM principle of the “winner-takes-it-all” approach. At least in large networks, any given observation should possibly leave its trace in different regions. Yet, our experience with very large maps indicate that this may happen almost inevitably. We just used very simple observations consisting of only 3 features (r,g, and b, such forming the RGB color triplet) and a large SOM, consisting of around 1’000’000 nodes. The topology was 4n, and the map was placed on a torus (no borders). After approx 200’000 observations, the uniqueness for color concepts started to become eroded. For some colors, two conceptual regions appeared.

In the further development of such SOMs, it is then quite naturally to let fibers grow between such regions, changing the topology of the SOM from that of a crystal to that of a brain. While the first is almost perfectly isotropic in exactly 3 dimensions, the topology of the brain is (due to the functional differentiation into tele-fibres) highly anisotropic in a high and variable dimensionality.

Conclusion

Here we discussed some basic design issues about self-organizing maps and introduced some improvements. We have seen that wording matters when it comes to represent even a mechanism. The issues we touched have been

  • – explicit distinction of intensionality and extensionality in the conceptualization of the SOM mechanism, leading to a whole “new” domain of SOM architectures;
  • – producing idealistic representations from a collection of extensional descriptions;
  • – dynamics in the extensionality domain, including embedding of other structures, thus proceeding to the principle of compartmentalization, functional differentiation and morphological growth;
  • – the distinction between modeling and associative storage, which require different morphological structures once they are distinguished;
  • – stuffing the SOM with self-organization in the strong sense;
  • – spatial layout, fixed rid versus the emergent patterns in a repulsion field of freely moving particles; distinguishing material particles from functional abstract nodes;
  • – nodes as active components of the grid;
  • – self-referentiality on the microscopic level that gives rise to emergent self-referentiality on the macroscopic level;
  • – programming style, which should not only be as abstract (and thus as general) as possible, but also has to proceed from strictly defined, strongly coupled object-oriented style to loosely coupled system based on messaging, even on the lowest levels of implementation, e.g. the interaction of nodes;
  • – functional differentiation of nodes, leading to dynamic, fractional dimensionality and topological anisotropy;

Yet, there are still much more aspects that have to be considered if one would try to approach processes on machinic substrate that could be give rise to what we call “thinking.” In discussing the design issues listed above, we remain quite on the material level. But of course, morphology is important. Nevertheless we should not conceive of morphology as a perfect instance of a blueprint, it is more about the potential, if not to say the “virtuality”, that is implied as immanence by the morphology. Beyond that morphology, we have to design the processes of dynamic change of that morphology, which we usually call growth, or tissue differentiation. Even on top of that, we have to think about the informational, i.e. immaterial processes, that only eventually lead to morphological correlates.

Anyway, when thinking about machine-based episteme, we obviously have to forget about crystals and swarms, about perfectness and symmetry in morphological structures. Instead, the design of all of the issues, whether material or immaterial, should be designed with the perspective towards an immanence of virtuality in mind, based on probabilized mechanisms.

In a further chapter (scheduled) we will try to approach two other design issues about the implementation of an advanced Self-organizing Map in more detail that we already mentioned briefly here, again oriented at basic abstract elements and the principles found in natural brains: inhibitory processes and probabilistic negation on the one hand and the chemical milieu on the other. Above we already indicated that we expect a continuum between Self-organizing Maps and Reaction-Diffusion Systems, which in our perspective is highly significant for the working of brains, whether natural or artificial ones.

۞

FluidSOM (Software)

January 25, 2012 § 7 Comments

The FluidSOM is a modular component of a SOM population

that is suitable to follow the “Growth & Differentiate” paradigm.

Self-Organizing Maps (SOM) are usually established on fixed grids, using a 4n or 6n topology. Implementations as swarms or gas are quite rare and also are burdened with particular problems. After all, we don’t have “swarms” or “gases” in our heads (at least most of us for most of the time…). This remains true even if we would consider only the informational part of the brain.

The fixed grid prohibits a “natural” growth or differentiation of the SOM-layer. Actually, this impossibility to differentiate also renders structural learning impossible. If we consider “learning” as something that is different from mere adjustment of already available internal parameters, then we could say that the inability to differentiate morphologically also means that that there is no true learning at all.

These limitations, among others, are overcome by our FluidSOM. Instead of fixed grid, we use a quasi-crystalline fluid of particles. This makes it very easy to add or to remove, to merge or to split “nodes”. The quasi-grid will always take a state of minimized tensions (at least after shaking it a bit … )

Instead of fixed grid, we use a quasi-crystalline fluid of particles. This makes it very easy to add or to remove, to merge or to split “nodes”. The quasi-grid will always take a state of minimized tensions (at least after shaking it a bit … )

As said, the particles of the collection may move around “freely”, there is no grid to which they are bound apriori. Yet, the population will arrange in an almost hexagonal arrangement… if certain conditions hold:

  • – The number of particles fits the dimensions of the available surface area.
  • – The particles are fully symmetric across the population regarding their properties.
  • – The parameters for mobility and repellent forces are suitably chosen

Deviations from a perfect hexagonal arrangement are thus quite frequent. Sometimes hexagons enclose an empty position, or pentagons establish instead of hexagons, frequently so near the border or immediately after a change of collection (adding/removing a particle). This, however, is not a drawback at all, especially not in in case of SOM layers that are relatively large (starting with N>~500). In really large layers comprising >100’000 nodes, the effect is neglectable. The advantage of such symmetry breaks on the geometrical level, i.e. on the quasi-material level, is that it provides a starting point for natural pathway of differentiation.

There is yet another advantage: The fluid layer contains particles that not necessarily are identical to the nodes of the SOM, and also the relations between nodes are not bound to the hosting grid.

The RepulsionField class allows for a confined space or for a borderless topology (a torus), the second of which is often more suitable to run a SOM.

Given all the advantages, there is the question why are fixed grids so dramatically preferred against fluid layouts? The answer is simple: it is not simple at all to implement them in a way that allows for a fast and constant query time for neighborhoods. If it takes 100ms to determine the neighborhood for a particular location in a large SOM layer, it would not be possible to run such a construct as a SOM at all: the waiting time would be prohibitive. Our Repulsion Field addresses this problem with buffering, such it is almost as fast as the neighborhood query in fixed grids.

So far, only the RepulsionField class is available, but the completed FluidSOM should follow soon.

The Repulsion Field of the FluidSOM is available through the Google project hosting in noolabfluidsom.

The following four screenshot images show four different selection regimes for the dynamic hexagonal grid:

  • – single node selection, here as an arbitrary group
  • – minimal spanning tree on this disjoint set of nodes
  • – convex hull on the same set
  • – conventional patch selection as it occurs in the learning phase of a SOM

As I already said, those particles may move around such that the total energy of the field gets minimized. Splitting a node as a metaphor for natural growth leads to a different layout, yet in a very smooth manner.

Fig 1a-d: The Repulsion Field used in FluidSOM.
Four different modes of selection are demonstrated.

To summarize, the change to the fluidic architecture comprises

  • – possibility for a separation of physical particles and logical node components
  • – possibility for dynamic seamless growth or differentiation of the SOM lattice, including the mobility of the “particles” that act as node containers;

Besides that FluidSOM offers a second major advance as compared to the common SOM concept. It concerns the concept of the nodes. In FluidSOM, nodes are active entities, stuffed with a partial autonomy. Nodes are not just passive data structures, they won’t “get updated2 by a central mechanism. In a salient contrast they maintain certain states comprised by activity and connectivity as well as their particular selection of a similarity function. Only in the beginning all nodes are equal with respect to those structural parameters. As a consequence of these properties, nodes in FluidSOM are able to outgrow (pullulate) new additional instances of FluidSOM as kind of offspring.

These two advances removes many limitations of the common concept of SOM (for more details see here).

There is last small improvement to introduce. In the snapshots shown above you may detect some “defects,” often as either holes within a perfect hexagon, or sometimes also as a pentagon. But generally it looks quite regular. Yet, this regularity is again more similar to crystals than to living tissue. We should not take the irregularity of living tissue as a deficiency. In nature there are indeed highly regular morphological structures, e.g. in the retina of the eyes in vertebrates, or the faceted eyes of insects. In some parts (motor brains) of some brains (especially birds) we can find quite regular structures. There is no reason to assume that evolutionary processes could not lead to regular cellular structures. Yet, we never will find “crystals” in any kind of brain, not even in insects.

Taking this as an advice, we should introduce a random factor into the basic settings of the particles, such that the emerging pattern will not be regular anymore. The repulsion principle still will lead to a locally stable configuration, though. Yet, strong re-arrangement flows are not excluded either. The following figure show the resulting layout for a random variation (within certain limits) of the repellent force.

Figure 2: The Repulsion Field of FluidSOM, in which the particle are individually parameterized with regard to the repellent force. This leads to significant deviations  from the hexagonal symmetry.

This broken symmetry is based on a local individuality with regard to repellent force attached to it. Albeit this individuality is only local and of a rather weak character, together with the fact of the symmetry break it helps to induce it is nevertheless important as a seed for differentiation. It is easy to imagine that the repellent forces are some (random) function of the content-related role of the nodes that are transported by the particles. For instance, large particles, could decrease or increase this repellent force, leading to a particular morphological correlates to the semantic activity of the nodes in a FluidSOM.

A further important property for the determining the neighborhood of a particle is directionality. The RepulsionField supports this selection mode, too. It is, however, completely controlled on the level of the nodes. Hence we will discuss it there.

Here you may directly download a zip archive containing a runnable file demonstrating the repulsion field (sorry for the size (6 Mb), it is not optimized for the web). Please note that you have to install java first (on Windows). Else, I recommend to read the file “readme.txt” which explains the available commands.

Machine-based Episteme/Epistemology

October 24, 2011 § Leave a comment

It is pretty clear that even if we just think of machines being able to understand this immediately triggers epistemological issues. If such a machine would be able to build previously non-existent representations of the outer world in a non-deterministic manner, a wide range of epistemological implications would be invoked for that machine-being, and these epistemological implications are largely the same as for humans or other cognitive well-developed organic life.

Epistemology investigates the conditions for “knowledge” and the philosophical consequences of knowing. “Knowledge” is notoriously difficult to define, and there are many misunderstandings around, including the “soft stone” of so-called “tacit knowledge”; yet for us it simply denotes a bundle consisting from

  • – a dynamic memory
  • – capacity for associative modeling, i.e. adaptively deriving rules about the world
  • – ability to act upon achieved models and memory
  • – self-oriented activity regarding the available knowledge
  • – already present information is used to extend the capabilities above

Note that we do not demand communication about knowledge. For several reasons and based on Wittgenstein’s theory on meaning we think that knowledge can not be transmitted or communicated. Recently, Linda Zagzebski [1] achieved the same result, starting from a different perspective. She writes, that “[…] knowledge is not an output of an agent, instead it is a feature of the agent.“. In agreement with at least some reasonably justified philosophical positions we thus propose that it is also reasonable to conceive of such a machine as mentioned before as being enabled to knowledge. Accordingly, it is indicated to assign the capability for knowing to the machine. That knowledge being comprised or constituted by the machine is not accessible for us as “creators” of the machine, for the very same reason of the difference of the Lebenswelten.

Yet, the knowledge acquired by the is also not “directly” accessible for the machine itself. In contrast to rationalist positions, knowledge can’t be separated from the whole of a cognitive entity. The only thing that is possible is to translate it into publicly available media like language, to negotiate a common usage of words and their associated link structures, and to debate about the mutually private experiences.

Resorting to the software running on the machine and checking the content of the machine will be not possible either. A software that will enable knowing can’t be decomposable in order to serve an explanation of that knowledge. The only things one will find is a distant analog to our neurons. As little as reductionism works for the human mind it will work for the machine.

Yet, such machine-knowledge is not comparable to human knowledge. The reason for that is not an issue of type, or extent. The reason is given by the fact that the Lebenswelt of the machine, that is the totality of all relations to the outer world and of all transformations of the perceiving and acting entity, the machine, would be completely different from ours. It will not make any sense to try to simulate any kind of human-like knowledge in that machine. It always will be drastically different.

The only possibility to speak about the knowing and the knowledge of the machine is through epistemological concepts. For us it doesn’t seem promising to engage in fields like “Cognitive Informatics,” since informatics (computer science) can not deal with cognition for rather fundamental reasons: Cognition is not Turing-computable.

The bridging bracket between the brains and minds of machine and human being is the theory of knowing. Consequently, we have to apply epistemology to deal with machines that possibly know. The conditions for that knowledge could turn out to be strange; else we should try to develop the theory of machine-based knowledge from the perspective of the machine. It is important to understand that attempts like the Turing-Test [2] are inappropriate, for several reasons so: (i) they follow the behavioristic paradigm, (ii) they do not offer the possibility to derive scales for comparison, (iii) no fruitful questions can be derived.

Additionally, there are some arguments pointing to the implicit instantiation of a theory as soon as something is going to be modeled. In other words, a machine which is able to know already has a—probably implicit—theory about it, and this also means about itself. That theory would originate in the machine (despite the fact that it can’t be a private theory). Hence, we open a branch and call it machine-based epistemology.

Some Historical Traces of ‘Contacts’

(between two strange disciplines)

Regarding the research about and the construction of “intelligent” machines, the relevance of thinking in epistemological terms has been recognized quite early. In 1963, A. Wallace published a paper entitled “Epistemological Foundations of Machine Intelligence”[3] that quite unfortunately is not available except the already remarkable abstract:

Abstract : A conceptual formulation of the Epistemological Foundations of Machine Intelligence is presented which is synthesized from the principles of physical and biological interaction theory on the one hand and the principles of mathematical group theory on the other. This synthesis, representing a fusion of classical ontology and epistemology, is generally called Scientific Epistemology to distinguish it from Classical General Systems theory. The resulting view of knowledge and intelligence is therefore hierarchical, evolutionary, ecological, and structural in character, and consequently exhibits substantial agreement with the latest developments in quantum physics, foundations of mathematics, general systems theory, bio-ecology, psychology, and bionics. The conceptual formulation is implemented by means of a nested sequence of structural Epistemological-Ontological Diagrams which approximate a strong global interaction description. The mathematico-physical structure is generalized from principles of duality and impotence, and the techniques of Lie Algebra and Lie Continuous Group theory.

As far as it is possible to get an impression about the actual but lost full paper, Wallace’s approach is formal and mathematical. Biological interaction theory at that time was a fork from mathematical information theory, at least in the U.S. where this paper originates. Another small weakness could be indicated by the notion of “hierarchical knowledge and intelligence,” pointing to some rest of positivism. Anyway, the proposed approach never was followed upon, unfortunately. Yet we will see in our considerations about modeling that the reference to structures like the Lie Group theory could not have worked out in a satisfying manner.

Another early instance of bringing epistemology into the research about “artificial intelligence” is McCarthy [4,5], who coined the term “Artificial Intelligence.” Yet, his perspective appears as by far too limited. First he starts with the reduction of epistemology to first-order logics:

“We have found first order logic to provide suitable languages for expressing facts about the world for epistemological research.” […]

Philosophers emphasize what is potentially knowable with maximal opportunities to observe and compute, whereas AI must take into account what is knowable with available observational and computational facilities.

Astonishingly, he does not mention any philosophical argument in the rest of the paper  except the last paragraph:

“More generally, we can imagine a metaphilosophy that has the same relation to philosophy that metamathematics has to mathematics. Metaphilosophy would study mathematical systems consisting of an “epistemologist” seeking knowledge in accordance with the epistemology to be tested and interacting with a “world”.” […] AI could benefit from building some very simple systems of this kind, and so might philosophy.”

McCarthy’s stance to philosophy is typical for the whole field. Besides the presumptuous suggestion of a “metaphilosophy” and subsuming it rather nonchalant to mathematics, he misses the point of epistemology, even as he refers to the machine as an “observer”: A theory of knowledge is about the conditions of the possibility for knowledge. McCarthy does not care about the implications of his moves to that possibility, or vice versa.

Important progress about the issue of the sate of machines was contributed not by the machine technologists themselves, but by philosophers, namely Putnam, Fodor, Searle and Dennett in the English speaking world, and also among French philosophers like Serres (in his “Hermes” series) and Guattari. The German systems theorists like von Foerster and Luhmann and their fellows never went beyond cybernetics, so we can omit them here.  In 1998, Wellner [6] provided a proposal for epistemology in the field of “Artificial Life” (what a terrible wording…). Yet, his attempt to contribute to the epistemological discussion turns out to be inspired by Luhmann’s perspective, and the “first step” he proposes is simply to stuff robots with sensory, i.e. finally it’s not really a valuable attempt to deal with epistemology in affairs of epistemic machines.

In 1978, Daniel Dennett [6] reframed the so-called “Frame Problem” of AI, of which already McCarthy and Hayes [3] got aware 10 years earlier. Dennet asks how

“a cognitive creature … with many beliefs about the world” can update those beliefs when it performs an act so that they remain “roughly faithful to the world”? (cited acc. to [8])

Recently, Dreyfus [9] and Wheeler [10], who yet disagrees about the reasoning with Dreyfus about it, called the Frame problem an illusionary pseudo-problem, created by the adherence to Cartesian assumptions. Wheeler described it as:

“The frame problem is the difficulty of explaining how non-magical systems think and act in ways that are adaptively sensitive to context-dependent relevance.”

Wheeler as well as Dreyfus recognize the basic problem(s) in the architecture of mainstream AI, and they identify Cartesianism as the underlying principle of these difficulties, i.e. the claim of analyticity, reducibility and identifiability. Yet, neither of the two so far proposes a stable solution. Heideggerian philosophy with its situationistic appeal does not help to clarify the epistemological affairs,  neither of machines nor of humans.

Our suggestion is the following: Firstly, a general solution should be found, how to conceive the (semi-)empirical relationship between beings that have some kind of empirical coating. Secondly, this general solution should serve as a basis to investigate the differences, if there are any, between machines and humans, regarding their epistemological affairs with the “external” world. This endeavor we label as  “machine-based epistemology.”

Machine-based Epistemology

If a machine, or better, a synthetic body that was established as a machine in the moment of its instantiation, would be able act freely, it would face the same epistemological problems as we humans, starting with basic sensory perception and not ending with linking a multi-modal integration of sensory input to adequate actions. Therefore machine-based epistemology (MBE) is the appropriate label for the research program that is dedicated to learning processes implemented on machines. We avoid invoking the concept of agents here, since this already brings in a lot of assumptions.

Note that MBE should not be mixed with so-called “Computer Epistemology”, which is concerned just about the design of so-called man-machine-interfaces [11]. We are not concerned about epistemological issues arising through the usage computers, of course.

It is clear that the term machine learning is missing the point, it is a pure technical term. Machine learning is about algorithms and programmable procedures, not about the reflection of the condition of that. Thus, it does not recognize the context into which learning machines are embedded, and in turn it misses also the consequences. In some way machine learning is not about learning about machines. It remains a pure engineering discipline.

As a consequence, one can find a lot of nonsense in the field of machine learning, especially concerning so-called ontologies and meta-data, but also about the topic of “learning” itself. There is the nonsensical term of “reinforcement learning”… which kind of learning could not be about (differential) reinforcement?

The other label Machine-based Epistemology is competing with is “Artificial Intelligence.” Check out the editorial text “Where is the Limit” for arguments against the label “AI.” The conclusion was that AI is too close to cybernetics and mathematical information theory, that it is infected by romanticism and it is difficult to operationalize, that it does not appropriately account for cultural effects onto the “learning subject.” Since AI is not connected natively to philosophy, there is no adequate treatment of language: AI never took the “Linguistic Turn.” Instead, the so-called philosophy of AI poses silly questions about “mental states.”

MBE is concerned about the theory of machines that possibly start to develop autonomous cognitive activity; you may call this “thinking.” You also may conceive it as a part of a “philosophy of mind.” Both notions, thinking and mind, may work in the pragmatics of everyday social situations, for a more strict investigation I think they are counter-productive: We should pay attention to language in order not to get vexed by it. If there is no “philosophy of unicorns,” then probably there also should not be a “philosophy of mind.” Both labels, thinking and mind, pretend to define a real and identifiable entity, albeit exactly this should be one of the targets for a clarification. Those labels can easily cause the misunderstanding of separable subjects. Instead, we could call it “philosophy of generalized mindfulness”, in order to avoid anthropomorphic chauvinism.

As a theory, MBE is not driven by engineering, as it is the case for AI; just the other way round, MBE itself is driving engineering. It somehow brings philosophical epistemology into the domain of engineering computer  systems that are able to learn. Such it is natively linked in a an already well-established manner to other fields in philosophy. Which, finally, helps to avoid to pose silly questions or to follow silly routes.

  • [1] Linda Zagzebski, contribution to: Jonathan Dancy, Ernest Sosa, Matthias Steup (eds.), “A Companion to Epistemology”, Vol. 4, pp.210; here p.212.
  • [2] Alan Turing (1950), Computing machinery and intelligence. Mind, 59(236): 433-460.
  • [3] Wallace, A. (1963), EPISTEMOLOGICAL FOUNDATIONS OF MACHINE INTELLIGENCE. Information for the defense Community (U.S.A.), Accession Number : AD0681147
  • [4] McCarthy, J. and Hayes, P.J. (1969) Some Philosophical Problems from the Standpoint of Artificial Intelligence. Machine Intelligence 4, pp.463-502 (eds Meltzer, B. and Michie, D.). Edinburgh University Press.
  • [5] McCarthy, J. 1977. Epistemological problems of artificial intelligence. In IJCAI, 1038-1044.
  • [6] Jörg Wellner 1998, Machine Epistemology for Artificial Life In: “Third German Workshop on Artificial Life”, edited by C. Wilke, S. Altmeyer, and T. Martinetz, pp. 225-238, Verlag Harri Deutsch.
  • [7]  Dennett, D. (1978), Brainstorms, MIT Press., p.128.
  • [8] Murray Shanahan (2004, rev.2009), The Frame Problem, Stanford Encyclopedia of Philosophy, available online.
  • [9] H.L. Dreyfus, (2008), “Why Heideggerian AI Failed and How Fixing It Would Require Making It More Heideggerian”, in The Mechanical Mind in History, eds. P.Husbands, O.Holland & M.Wheeler, MIT Press, pp. 331–371.
  • [10] Michael Wheeler (2008), Cognition in Context: Phenomenology, Situated Robotics and the Frame Problem. Int.J.Phil.Stud. 16(3), 323-349.
  • [11] Tibor Vamos, Computer Epistemology: A Treatise in the Feasibility of the Unfeasible or Old Ideas Brewed New. World Scientific Pub, 1991.

۞

The Self-Organizing Map – an Intro

October 20, 2011 § Leave a comment

A Map that organizes itself:

Is it the paradise for navigators, or is it the hell?

Well, it depends, I would say. As a control freak, or a warfarer like Shannon in the early 1940ies, you probably vote for the hell. And indeed, there are presumably only very few entities that have been so strongly neglected by information scientists like it was the case for the self-organizing map. Of course, there are some reasons for that. The other type of navigator probably enjoying the SOM is more likely of the type Odysseus, or Baudolino, the hero in Umberto Eco’s novel of the same name.

More seriously, the self-organizing map (SOM) is a powerful and even today (2011) a still underestimated structure, though meanwhile rapidly gaining in popularity. This chapter serves as a basic introduction into the SOM, along with a first discussion of the strength and weaknesses of its original version. Today, there are many versions around, mainly in research; the most important ones I will briefly mention at the end. It should be clear that there are tons of articles around in the web. Yet, most of them focus on the mathematics of the more original versions, but do not describe or discuss the architecture itself, or even provide a suitable high-level interpretation of what is going on in a SOM. So, I will not repeat the mathematics, instead I will try to explain it also for non-engineers without using mathematical formulas. Actually, the mathematics is not the most serious thing in it anyway.

Brief

The SOM is a bundle comprising a mathematical structure and a particularly designed procedure for feeding multi-dimensional (multi-attributes) data into it that are prepared as a table. Numbers of attributes can reach tens of thousands. Its purpose is to infer the best possible sorting of the data in a 2(3) dimensional grid. Yet, both preconditions, dimensionality and data as table, are not absolute and may be overcome by certain extensions to the original version of the SOM. The sorting process groups more similar records closer together. Thus we can say that a body of high-dimensional data (organized as records from a table) are mapped onto 2 dimensions, thereby enforcing a weighting of the properties used to describe (essentially: create) the records.

The SOM can be parametrized such that it is a very robust method for clustering data. The SOM exhibits an interesting duality, as it can be used for basic clustering as well as for target oriented predictive modeling. This duality opens interesting possibilities for realizing a pre-specific associative storage. The SOM is particularly interesting due to its structure and hence due to its extensibility, properties that other most methods do not share with the SOM. Though substantially different to other popular structures like Artificial Neural Networks, the SOM may be included into the family of connectionist models.

History

The development leading finally to the SOM started around 1973 in a computer science lab at the Helsinki university. It was Teuvo Kohonen who got aware to certain memory effects of correlation matrices. Until 1979, when he first published the principle of the Self-Organizing Map, he dedicatedly adopted principles known from the human brain. A few further papers followed and a book about the subject in 1983. Then, the SOM wasn’t readily adapted for at least 15 years. Its direct competitor for acceptance, the Backpropagation Artificial Neural network (B/ANN), was published in 1985, after the neural networks have been rediscovered in physics, following investigations of spin glasses and certain memory effects there. Actually, the interest in simulating neural networks dates back to 1941, when von Neumann, Hebb, McCulloch, and also Pitts, among others, met at a conference on the topic.

For a long time the SOM wasn’t regarded as a “neural network,” and this has been considered being a serious disadvantage. The first part of the diagnosis indeed was true: Kohonen never tried to simulate individual neurons, as it was the goal for all simulations of ANN. The ANN research has been deeply informed by physics, cybernetics and mathematical information theory. Simulating neurons is simply not adequate, it is kind of silly science. Above all, most ANN are just a very particular type of a “network” as there are no connections within a particular layer. In contrast, Kohonen tried to grasp a more abstract level: the population of neurons. In our opinion this choice is much more feasible and much more powerful as well. In particular, SOM can not only represent “neurons,” but any population of entities which can exchange and store information. More about that in a moment.

Nowadays, the methodology of SOM can be rated as well adopted. More than 8’000 research papers have been published so far, with increasing momentum, covering a lot of practical domains and research areas. Many have demonstrated the superiority or greater generality of SOM as compared to other methods.

Mechanism

The mechanism of a basic SOM is quite easy o describe, since there are only a few ingredients.
First, we need data. Imagine a table, where observations are listed in rows, and the column headers describe the variables that have been measured for each observed case. The variables are also called attributes, or features. Note, that in the case of the basic (say, the Kohonen-) SOM the structure given by the attributes is the same for all records. Technically, the data have to be normalized per column such that the minimum value is 0 and the maximum value is 1. Note, that this normalization ensures comparability of different sub-sets of observations. It represents just the most basic transformation of data, while there are many other transformation possible: logarithmic re-scaling of values of a column in order to shift the mode of the empirical distribution, splitting a variable by value, binarization, scaling of parameters that are available only on nominal niveau, or combining two or several columns by a formula are further examples (for details please visit the chapter about modeling). In fact, the transformation of data (I am not talking here about the preparation of data!) is one of the most important ingredients for successful predictive modeling.

Second, we create the SOM. Basically, and its simplest form, the SOM is a grid, where each cell has 4 (squares, rectangles) or 6 edges (hexagonal layout). The grid consists from nodes and edges. Nodes serve as a kind of container, while edges work as a kind of fibers for spreading signals. In some versions of the SOM the nodes can range freely or they can randomly move around a little bit.

An important element of the architecture of a SOM now is that each node gets the same structure assigned as we know from the table. As a consequence, the vectors collected in the nodes can easily be compared by some function (just wait a second for that). In the beginning, each node get randomly initialized. Then the data are fed into the SOM.

This data feeding is organized as follows. A randomly chosen record is taken from the table and then compared to all of the nodes. There is always a best matching node. The record then gets inserted into this node. Upon this insertion, which is kind of hiding, the values in the nodes structure vector are recalculated, e.g. as the (new) mean for all values across all records collected in a node (container). The trick now is not to change just the winning node where the data record has been inserted, but all nodes of the the close surround also, though with a strength that decreases with the distance.

This small activity of searching the best matching node, insertion and information spreading is done for all records, and possibly also repeated. The spreading of information to the neighbor nodes is a crucial element in the SOM mechanism. This spreading is responsible for the self-organization. It also represents a probabilistic coupling in a network. Of course, there are some important variants to that, see below, but basically that’s all. Ok, there is some numerical bookkeeping, optimizations to search the winning nodes etc. but these measures are not essential for the mechanism.

As a result one will find similar records in the same node, or a the direct neighbors. It has been shown that the SOM is topology preserving, that is, the SOM is smooth with regard to the similarity of neighbor nodes. The data records inside the nodes are a list, which is described by the node’s value vector. That value vector could be said to represent a class, or intension, which is defined by its empirical observations, the cases, or extension.

After feeding all data to the SOM the training has been finished. For SOM, it is easy to run in a continuous mode, where the feed of incoming data is not “stopping” at any time. Now the SOM can be used to classify new records. A new record simply needs to be compared to the nodes of the SOM, i.e. to the value vector of the nodes, but NOT to all the cases (SOM is not case-based reasoning, but type-based reasoning!). If the records contained a marker attribute, e.g. indicating the quality of the record, you will also get the expected quality for a new record of unknown quality.

Properties of the SOM

The SOM belongs to the class of clustering algorithms. It is very robust against missing values in the table, and unlike many other methods it does NOT require any settings regarding the clusters, such as size or number. Of course, this is a great advantage and a property of logical consistency. Nodes may remain empty, while the node value vector of the empty node is well-defined. This is a very important feature of the SOM, as this represents the capability to infer potential yet unseen observations. No other method is able to behave like this. Other properties can be invoked by means of possible extensions of the basic mechanism (see below)

As already said, nodes collect similar records of data, where a record represents a single observation. It is important to understand, that a node does not equal to a cluster. In our opinion, it does not make much sense to draw boundaries around one or several nodes and so  proposing a particular “cluster.” This boundary should be set only upon an external purpose. Inversely, without such a purpose, it is sense-free to conceive of a trained SOM as a model. AT best, it would represent a pre-specific model, which however is a great property of the SOM to be able to create such.

The learning is competitive, since different nodes compete for a single record. Yet, it is also cooperative, since the upon an insert operation information is exchanged between neighbor nodes.

The reasoning of the SOM is type-based, which is much more favorable than case-based reasoning. It is also more flexible than ANN, which just provide a classification, but no distinction between extension and intension is provided. SOM, but not ANN, can be used in two very different modes. Either just for clustering or grouping individual observations without any further assumptions, and secondly for targeted modeling, that is for establishing a predictive/ diagnostic link between several (or many) basic input variables and one (or several) target variable(s) that represent the outcome of a process according to experience. Such a double usage of the same structure is barely accessible for any other associative structure.

Another difference is that ANN are much more suitable to approximate single analytic functions, while SOM are suitable for general classification tasks, where the parameter space and/or the value (outcome) space could even be discontinuous or folded.

A large advantage over many other methods is that the similarity function and the cost function is explicitly accessible. For ANN, SVM or statistical learning this is not possible. Similarly, the SOM automatically adapts its structure according to the data, i.e. it is also possible to change the method within the learning process, adaptively and self-organized.

As a result we can conclude that the design of the SOM method is much more transparent than that of any other of the advanced methods.

Competing Comparable Methods

SOM are either more robust, more general or more simple than any other method, while the quality of classification is closely comparable. Among those competing methods are artificial neural networks (ANN), principal component analysis (PCA), multi-dimensional scaling (MDS), or adaptive resonance theory network (ART). Important ideas of ART networks can be merged with the SOM principle, keeping the benefits of both. PCA and MDS are based on statistical correlation analysis (covariance matrices), i.e. they are importing all the assumptions and necessary precondition of statistics, namely the independence of observational variables. Yet, it is the goal to identify such dependencies, thus it is not quite feasible to presuppose that! SOM do not know such limitations from strange assumptions; else, recently it has been proven that SOM are just a generalization of PCA.

Of course, there are many other methods, like Support Vector Machines (SVM) with statistical kernels, or tree forests; yet, these methods are purely statistical in nature, with no structural part in it. Else, they do not provide access to the similarity function as it is possible for the SOM.

A last word about the particular difference between ANN and SOM. SOM are true symmetrical networks, where each unit has its own explicit memory about observations, while the linkage to other units on the same level of integration is probabilistic. That means, that the actual linkage between any two units can be changed dynamically within the learning process. In fact, a SOM is thus not a single network like a fisher net, it is much more appropriate to conceive them as a representation of a manifold of networks.

Contrary to those advanced structural properties, the so-called Artificial Neural Networks are explicit directional networks.Units represent individual neurons and do not have storage capacities. Each unit does not know anything about things like observations.  Conceptually, these units are thus on a much lower level than the units in a SOM. In ANN they can not have “inner” structure. The same is true for for the links between the units. Since they have to be programmed in an explicit manner (which is called “architecture”), the topology of the connections can not be changed during learning at runtime of the program.

In ANN information flows through the units in a directed manner (as in case of natural neurons). It is there almost impossible to create an equally dense connectivity within a single layer of units as in SOM. As a consequence, ANN do not show the capability for self-organization.

Taken as whole, ANN seem to be under the representationalist delusion. In order to achieve the same general effects  and abstract phenomena as the SOM are able to, very large ANN would be necessary. Hence, pure ANN are not really a valid alternative for our endeavor. This does not rule out the possibility to use them as components within a SOM or between SOMs.

Variants and Architectures

Here are some SOM extensions and improvements of the SOM.

Homogenized Extensional Diversity
The original version of the SOM tends to collect “bad” records, those not matching well anywhere else, into a single cluster, even if the records are not similar at all. In this case it is not allowed to compare nodes any more, since the internal variance is not comparable anymore and the mean/variance on the level of the node would not describe the internal variance on the level of the collected records any more. The cure for that misbehavior is rather simple. The cost function controlling the matching of a record to the most similar node needs to contain the variability within the set of records (extension of the type represented by the node) collected by the node. Else, merging and splitting of nodes as described for structural learning helps effectively. In scientific literature, there is yet no reference for this extension of the SOM.

Structural Learning
One of the most basic extensions to the original mechanism is to allow for splitting and merging of nodes according to some internal divergence criterion.  A SOM made from such nodes is able to adopt structurally to the input data, not just statistically. This feature is inspired by the so-called ART-networks [1]. Similarly, merging and splitting of “nodes” of a SOM was proposed by [2], though not in the cue of ART networks.

Nested SOM
Since the SOM represents populations of neurons, it is easy and straightforward to think about nesting of SOM. Each node would contain a smaller SOM. A node may even contain any other parametrized method, such like Artificial Neural Networks. The node value vector then would not exhibit the structure of the table, but instead would display the parameters of the enclosed algorithm. One example for this is the so-called mn-SOM [3].

Growing SOM
Usually, data are not evenly distributed. Hence, some nodes grow much more than others. One way to cope with this situation is to automatically let the SOM grow. many different variants of growth could be thought of and some already has been implemented. Our experiments point into the direction of a “stem cell” analog.

Growing SOMs have first been proposed by [4], while [5] provides some exploratory implementation. Concerning growing SOM, it is very important to understand the concept (or phenomenon) of growth. We will discuss possible growth patterns and the  consequences for possible and new growing SOMs elsewhere. Just for now we can say that any kind of SOM structure can grow and/or differentiate.

SOM gas, mobile nodes
The name already says it: the topology of the grid making up the SOM is not fixed. Nodes even may range around quite freely, as in the case of SOM Gas.

Chemical SOM
Starting from mobile nodes, we can think about a small set of properties of nodes which are not directly given by the data structure. These properties can be interpreted as chemicals creating sth. like a Gray-Scott Reaction-Diffusion-Model, i.e. a self-organizing fluid dynamics. The possible effects are (i) a differential opacity of the SOM for transmitted information, or (ii) the differentiation into fibers and networks, or (iii) the optimization of the topological structure as a standard part of the life cycle of the SOM. The mobility can be controlled internally by means of a “temperature,” or expressed by a metaphor, the fixed SOM would melt partially. This may help to reorganize a SOM. In scientific literature, there is yet no reference for this extension of the SOM.

Evolutionary Embedded SOM with Meta-Modeling
SOM can be embedded into an evolutionary learning about the most appropriate selection of attributes. This can be extended even towards the construction of structural hypothesis about the data. While other methods could be also embedded in a similar manner, the results are drastically different, since most methods do not learn structurally. Coupling evolutionary processes with associative structures was proposed a long time ago by [6], albeit only in the context of optimization of ANN. While this is quite reasonable, we additionally propose to use evolution in a different manner and for different purposes (see the chapter about evolution)

[1] ART networks
[2] merging splitting of nodes
[3] The mn-SOM
[4] Growing SOM a
[5] Growing SOM b
[6] evolutionary optimization of artificial neural networks

۞

Where Am I?

You are currently browsing entries tagged with connectionism at The "Putnam Program".