February 14, 2012 § Leave a comment
Memory, our memory, is a wonderful thing. Most of the time.
Yet, it also can trap you, sometimes terribly, if you use it in inappropriate ways.
Think about the problematics of being a witness. As long as you don’t try to remember exactly you know precisely. As soon as you start to try to achieve perfect recall, everything starts to become fluid, first, then fuzzy and increasingly blurry. As if there would be some kind of uncertainty principle, similar to Heisenberg’s . There are other tricks, such as asking a person the same question over and over again. Any degree of security, hence knowledge, will vanish. In the other direction, everybody knows about the experience that a tiny little smell or sound triggers a whole story in memory, and often one that have not been cared about for a long time.
The main strengths of memory—extensibility, adaptivity, contextuality and flexibility—could be considered also as its main weakness, if we expect perfect reproducibility for results of “queries”. Yet, memory is not a data base. There are neither symbols, nor indexes, and at the deeper levels of its mechanisms, also no signs. There is no particular neuron that would “contain” information as a file on a computer can be regarded able to provide.
Databases are, of course, extremely useful, precisely because they can’t do in other ways as to reproduce answers perfectly. That’s how they are designed and constructed. And precisely for the same reason we may state that databases are dead entities, like crystals.
The reproducibility provided by databases expels time. We can write something into a database, stop everything, and continue precisely at the same point. Databases do not own their own time. Hence, they are purely physical entities. As a consequence, databases do not/can not think. They can’t bring or put things together, they do not associate, superpose, or mix. Everything is under the control of an external entity. A database does not learn when the amount of bits stored inside it increases. We also have to be very clear about the fact that a database does not interpret anything. All this should not be understood as a criticism, of course, these properties are intended by design.
The first important consequence about this is that any system relying just on the principles of a database also will inherit these properties. This raises the question about the necessary and sufficient conditions for the foundations of “storage” devices that allow for learning and informational adaptivity.
As a first step one could argue that artificial systems capable for learning, for instance self-organizing maps, or any other “learning algorithm”, may consist of a database and a processor. This would represent the bare bones of the classic von Neumann architecture.
The essence of this architecture is, again, reproducibility as a design intention. The processor is basically empty. As long as the database is not part of a self-referential arrangement, there won’t be something like a morphological change.
Learning without change of structure is not learning but only changing the value of structural parameters that have been defined apriori (at implementation time). The crucial step however would be to introduce those parameters at all. We will return to this point at a later stage of our discussion, when it comes to describe the processing capabilities of self-organizing maps.1
Of course, the boundaries are not well defined here. We may implement a system in a very abstract manner such that a change in the value of such highly abstract parameters indeed involves deep structural changes. In the end, almost everything can be expressed by some parameters and their values. That’s nothing else than the principle of the Deleuzean differential.
What we want to emphasize here is just the issue that (1) morphological changes are necessary in order to establish learning, and (2) these changes should be established in response to the environment (and the information flowing from there into the system). These two condition together establish a third one, namely that (3) a historical contingency is established that acts as a constraint on the further potential changes and responses of the system. The system acquires individuality. Individuality and learning are co-extensive. Quite obviously, such a system is not a von Neumann device any longer, even if it still runs on a such a linear machine.
Our claim here is that the “learning” requires a particular perspective on the concept of “data” and its “storage.” And, correspondingly, without the changed concept about the relation between data and storage, the emergence of machine-based episteme will not be achievable.
Let us just contrast the two ends of our space.
- (1) At the logical end we have the von Neumann architecture, characterized by empty processors, perfect reproducibility on an atomic level, the “bit”; there is no morphological change; only estimation of predefined parameters can be achieved.
- (2) The opposite end is made from historically contingent structures for perception, transformation and association, where the morphology changes due to the interaction with the perceived information2; we will observe emergence of individuality; morphological structures are always just relative to the experienced influences; learning occurs and is structural learning.
With regard to a system that is able to learn, one possible conclusion from that would be to drop the distinction between storage of encoded information and the treatment of that encodings. Perhaps, it is the only viable conclusion to this end.
In the rest of this chapter we will demonstrate how the separation between data and their transformation can be overcome on the basis of self-organizing maps. Such a device we call “associative storage”. We also will find a particular relation between such an associative storage and modeling3. Notably, both tasks can be accomplished by self-organizing maps.
When taking the perspective from the side of usage there is still another large contrasting difference between databases and associative storage (“memories”). In case of a database, the purpose of a storage event is known at the time of performing the storing operation. In case of memories and associative storage this purpose is not known, and often can’t be reasonably expected to be knowable by principle.
From that we can derive a quite important consequence. In order to build a memory, we have to avoid storing the items “as such,” as it is the case for databases. We may call this the (naive) representational approach. Philosophically, the stored items do not have any structure inside the storage device, neither an inner structure, nor an outer one. Any item appears as a primitive qualia.
The contrast to the process in an associative storage is indeed a strong one. Here, it is simply forbidden to store items in an isolated manner, without relation to other items, as an engram, an encoded and reversibly decodable series of bits. Since a database works perfectly reversible and reproducible, we can encode the graphem of a word into a series of bits and later decode that series back into a graphem again, which in turn we as humans (with memory inside the skull) can interpret as words. Strictly taken, we do NOT use the database to store words.
More concretely, what we have to do with the items comprises two independent steps:
- (1) Items have to be stored as context.
- (2) Items have to be stored as probabilized items.
The second part of our re-organized approach to storage is a consequence of the impossibility to know about future uses of a stored item. Taken inversely, using a database for storage always and strictly implies that the storage agent claims to know perfectly about future uses. It is precisely this implication that renders long-lasting storage projects so problematic, if not impossible.
In other words, and even more concise, we may say that in order to build a dynamic and extensible memory we have to store items in a particular form.
Memory is built on the basis of a population of probabilistic contexts in and by an associative structure.
The Two-Layer SOM
In a highly interesting prototypical model project (codename “WEBSOM”) Kaski (a collaborator of Kohonen) introduced a particular SOM architecture that serves the requirements as described above . Yet, Kohonen (and all of his colleagues alike) did not recognize so far the actual status of that architecture. We already mentioned this point in the chapter about some improvements of the SOM design; Kohonen fails to discern modeling from sorting, when he uses the associative storage as a modeling device. Yet, modeling requires a purpose, operationalized into one or more target criteria. Hence, an associative storage device like the two-layer SOM can be conceived as a pre-specific model only.
Nevertheless, this SOM architecture is not only highly remarkable, but we also can easily extend it appropriately; thus it is indeed so important, at least as a starting point, that we describe it briefly here.
Context and Basic Idea
The context for which the two-layer SOM (TL-SOM) has been created is document retrieval by classification of texts. From the perspective of classification,texts are highly complex entities. This complexity of texts derives from the following properties:
- – there are different levels of context;
- – there are rich organizational constraints, e.g. grammars
- – there is a large corpus of words;
- – there is a large number of relations that not only form a network, but which also change dynamically in the course of interpretation.
Taken together, these properties turn texts into ill-defined or even undefinable entities, for which it is not possible to provide a structural description, e.g. as a set of features, and particularly not in advance to the analysis. Briefly, texts are unstructured data. It is clear, that especially non-contextual methods like the infamous n-grams are deeply inappropriate for the description, and hence also for the modeling of texts. The peculiarity of texts has been recognized long before the age of computers. Around 1830 Friedrich Schleiermacher founded the discipline of hermeneutics as a response to the complexity of texts. In the last decades of the 20ieth century, it was Jacques Derrida who brought in a new perspective on it. in Deleuzean terms, texts are always and inevitably deterritorialized to a significant portion. Kaski & coworkers addressed only a modest part of these vast problematics, the classification of texts.
The starting point they took by was to preserve context. The large variety of contexts makes it impossible to take any kind of raw data directly as input for the SOM. That means that the contexts had to be encoded in a proper manner. The trick is to use a SOM for this encoding (details in next section below). This SOM represents the first layer. The subject of this SOM are the contexts of words (definition below). The “state” of this first SOM is then used to create the input for the SOM on the second layer, which then addresses the texts. In this way, the size of the input vectors are standardized and reduced in size.
Elements of a Two-Layer SOM
The elements, or building blocks, of a TL-SOM devised for the classification of texts are
- (1) random contexts,
- (2) the map of categories (word classes)
- (3) the map of texts
The Random Context
A random context encodes the context of any of the words in a text. let us assume for the sake of simplicity that the context is bilateral symmetric according to 2n+1, i.e. for example with n=3 the length of the context is 7, where the focused word (“structure”) is at pos 3 (when counting starts with 0).
Let us resort to the following example, that take just two snippets from this text. The numbers represent some arbitrary enumeration of the relative positions of the words.
|sequence A of words rel. positions in text||“… without change of structureis not learning …”53 54 55 56 57 58 59|
|sequence B of words rel. positions in text||“… not have any structureinside the storage …”19 20 21 22 23 24 25|
The position numbers we just need for calculating the positional distance between words. The interesting word here is “structure”.
For the next step you have to think about the words listed in a catalog of indexes, that is as a set whose order is arbitrary but fixed. In this way, any of the words gets its unique numerical fingerprint.
|1264||structure||0.270 0.938 0.417 0.299 0.991 …|
|1265||learning||0.330 0.990 0.827 0.828 0.445 …|
|1266||Alabama||0.375 0.725 0.435 0.025 0.915 …|
|1267||without||0.422 0.072 0.282 0.157 0.155 …|
|1268||storage||0.237 0.345 0.023 0.777 0.569 …|
|1269||not||0.706 0.881 0.603 0.673 0.473 …|
|1270||change||0.170 0.247 0.734 0.383 0.905 …|
|1271||have||0.735 0.472 0.661 0.539 0.275 …|
|1272||inside||0.230 0.772 0.973 0.242 0.224 …|
|1273||any||0.509 0.445 0.531 0.216 0.105 …|
|1274||of||0.834 0.502 0.481 0.971 0.711 …|
|1274||is||0.935 0.967 0.549 0.572 0.001 …|
Any of the words of a text can now be replaced by an apriori determined vector of random values from [0..1]; the dimensionality of those random vectors should be around 80 in order to approximate orthogonality among all those vectors. Just to be clear: these random vectors are taken from a fixed codebook, a catalog as sketched above, where each word is assigned to exactly one such vector.
Once we have performed this replacement, we can calculate the averaged vectors per relative position of the context. In case of the example above, we would calculate the reference vector for position n=0 as the average from the vectors encoding the words “without” and “not”.
Let us be more explicit. For example sentence A we translate first into the positional number, interpret this positional number as a column header, and fill the column with the values of its respective fingerprint. For the 7 positions (-3, +3) we get 7 columns:
|sequence A of words||“… without change of structure is not learning …”|
|rel. positions in text||53 54 55 56 57 58 59|
|grouped around “structure”||-3 -2 -1 0 1 2 3|
…further entries of the fingerprints…
The same we have to do for the second sequence B. Now we have to tables of fingerprints, both comprising 7 columns and N rows, where N is the length of the fingerprint. From these two tables we calculate the average value and put it into a new table (which is of course also of dimensions 7xN). Such, the example above yields 7 such averaged reference vectors. If we have a dimensionality of 80 for the random vectors we end up with a matrix of [r,c] = [80,7].
In a final step we concatenate the columns into a single vector, yielding a vector of 7×80=560 variables. This might appear as a large vector. Yet, it is much smaller than the whole corpus of words in a text. Additionally, such vectors can be compressed by the technique of random projection (math. foundations by , first proposed for data analysis by , utilized for SOMs later by  and ), which today is quite popular in data analysis. Random projection works by matrix multiplication. Our vector (1R x 560C) gets multiplied with a matrix M(r) of 560R x 100C, yielding a vector of 1R x 100C. The matrix M(r) also consists of flat random values. This technique is very interesting, because no relevant information is lost, but the vector gets shortened considerable. Of course, in an absolute sense there is a loss of information. Yet, the SOM only needs the information which is important to distinguish the observations.
This technique of transferring a sequence made from items encoded on an symbolic level into a vector that is based on random context can be applied to any symbolic sequence of course.
For instance, it would be a drastic case of reductionism to conceive of the path taken by humans in an urban environment just as a sequence locations. Humans are symbolic beings and the urban environment is full of symbols to which we respond. Yet, for the population-oriented perspective any individual path is just a possible path. Naturally, we interpret it as a random path. The path taken through a city needs to be described both by location and symbol.
The advantage of the SOM is that the random vectors that encode the symbolic aspect can be combined seamlessly with any other kind of information, e.g. the locational coordinates. That’s the property of the multi-modality. Which particular combination of “properties” then is suitable to classify the paths for a given question then is subject for “standard” extended modeling as described inthe chapter Technical Aspects of Modeling.
The Map of Categories (Word Classes)
From these random context vectors we can now build a SOM. Similar contexts will arrange in adjacent regions.
A particular text now can be described by its differential abundance across that SOM. Remember that we have sent the random contexts of many texts (or text snippets) to the SOM. To achieve such a description a (relative) frequency histogram is calculated, which has as much classes as the SOM node count is. The values of the histogram is the relative frequency (“probability”) for the presence of a particular text in comparison to all other texts.
Any particular text is now described by a fingerprint, that contains highly relevant information about
- – the context of all words as a probability measure;
- – the relative topological density of similar contextual embeddings;
- – the particularity of texts across all contextual descriptions, again as a probability measure;
Those fingerprints represent texts and they are ready-mades for the final step, “learning” the classes by the SOM on the second layer in order to identify groups of “similar” texts.
It is clear, that this basic variant of a Two-Layer SOM procedure can be improved in multiple ways. Yet, the idea should be clear. Some of those improvements are
- – to use a fully developed concept of context, e.g. this one, instead of a constant length context and a context without inner structure;
- – evaluating not just the histogram as a foundation of the fingerprint of a text, but also the sequence of nodes according to the sequence of contexts; that sequence can be processed using a Markov-process method, such as HMM, Conditional Random Fields, or, in a self-similar approach, by applying the method of random contexts to the sequence of nodes;
- – reflecting at least parts of the “syntactical” structure of the text, such as sentences, paragraphs, and sections, as well as the grammatical role of words;
- – enriching the information about “words” by representing them not only in their observed form, but also as their close synonyms, or stuffed with the information about pointers to semantically related words as this can be taken from labeled corpuses.
We want to briefly return to the first layer. Just imagine not to measure the histogram, but instead to follow the indices of the contexts across the developed map by your fingertips. A particular path, or virtual movement appears. I think that it is crucial to reflect this virtual movement in the input data for the second layer.
The reward could be significant, indeed. It offers nothing less than a model for conceptual slippage, a term which has been emphasized by Douglas Hofstadter throughout his research on analogical and creative thinking. Note that in our modified TL-SOM this capacity is not an “extra function” that had to be programmed. It is deeply built “into” the system, or in other words, it makes up its character. Besides Hofstadter’s proposal which is based on a completely different approach, and for a different task, we do not know of any other system that would be able for that. We even may expect that the efficient production of metaphors can be achieved by it, which is not an insignificant goal, since all the practiced language is always metaphoric.
We already mentioned that the method of TL-SOM extracts important pieces of information about a text and represents it as a probabilistic measure. The SOM does not contain the whole piece of text as single entity, or a series of otherwise unconnected entities, the words. The SOM breaks the text up into overlapping pieces, or better, into overlapping probabilistic descriptions of such pieces.
It would be a serious misunderstanding to perceive this splitting into pieces as a drawback or failure. It is the mandatory prerequisite for building an associative storage.
Any further target oriented modeling would refer to the two layers of a TL-SOM, but never to the raw input text.Such it can work reasonable fast for a whole range of different tasks. One of those tasks that can be solved by a combination of associative storage and true (targeted) modeling is to find an optimized model for a given text, or any text snippet, including the identification of the discriminating features. We also can turn the perspective around, addressing the query to the SOM about an alternative formulation in a given context…
From Associative Storage towards Memory
Despite its power and its potential as associative storage, the Two-Layer SOM still can’t be conceived as a memory device. The associative storage just takes the probabilistically described contexts and sorts it topologically into the map. In order to establish “memory” further components are required that provides the goal orientation.
Within the world of self-organizing maps, simple (!) memories are easy to establish. We just have to combine a SOM that acts as associative storage with a SOM for targeted modeling. The peculiar distinctive feature of that second SOM for modeling is that it does not work on external data, but on “data” as it is available in and as the SOM that acts as associative storage.
We may establish a vivid memory in its full meaning if we establish two further components: (1) targeted modeling via the SOM principle, (2) a repository about the targeted models that have been built from (or using) the associative storage, and (3) at least a partial operationalization of a self-reflective mechanism, i.e. a modeling process that is going to model the working of the TL-SOM. Since in our framework the basic SOM module is able to grow and to differentiate, there is no principle limitation of/for such a system any more, concerning its capability to build concepts, models, and (logical) habits for navigating between them. Later, we will call the “space” where this navigation takes place “choreosteme“: Drawing figures into the open space of epistemic conditionability.
From such a memory we may expect dramatic progress concerning the “intelligence” of machines. The only questionable thing is whether we should call such an entity still a machine. I guess, there is neither a word nor a concept for it.
1. Self-organizing maps have some amazing properties on the level of their interpretation, which they share especially with the Markov models. As such, the SOM and Markov models are outstanding. Both, the SOM as well as the Markov model can be conceived as devices that can be used to turn programming statements, i.e. all the IF-THEN-ELSE statements occurring in a program as DATA. Even logic itself, or more precisely, any quasi-logic, is getting transformed into data.SOM and Markov models are double-articulated (a Deleuzean notion) into logic on the one side and the empiric on the other.
In order to achieve such, a full write access is necessary to the extensional as well as the intensional layer of a model. Hence, artificial neuronal networks (nor, of course, statistical methods like PCA) can’t be used to achieve the same effect.
2. It is quite important not to forget that (in our framework) information is nothing that “is out there.” If we follow the primacy of interpretation, for which there are good reasons, we also have to acknowledge that information is not a substantial entity that could be stored or processed. Information is nothing else than the actual characteristics of the process of interpretation. These characteristics can’t be detached from the underlying process, because this process is represented by the whole system.
3. Keep in mind that we only can talk about modeling in a reasonable manner if there is an operationalization of the purpose, i.e. if we perform target oriented modeling.
-  Werner Heisenberg. Uncertainty Principle.
-  Samuel Kaski, Timo Honkela, Krista Lagus, Teuvo Kohonen (1998). WEBSOM – Self-organizing maps of document collections. Neurocomputing 21 (1998) 101-117.
-  W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189–206. Amer. Math. Soc., 1984.
-  R. Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. In J.M. Zurada, R.J. Marks II, and C.J. Robinson, editors, Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994.
-  Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. Proceedings of the Seventeenth ACM Symposium on the Principles of Database Systems (pp. 159-168). ACM press.
-  Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: Applications to image and text data. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 245-250). ACM Press.
February 4, 2012 § 1 Comment
It is the duality of persistent, quasi-material yet simulated structures
and the highly dynamic, volatile and-most salient-informational aspects that are so characteristic for learning entities like Self-Organizing Maps (SOM) or Artificial Neural Networks (ANN). It should not be regarded as a surprise that the design of manifold aspects of the persistent, quasi-material part of SOM or ANN is quite influential and hence also important.
Here we explore some of the aspects of that design. Sure, there is something like a “classic” version of the SOM, named after its inventor, the so-called “Kohonen-SOM.” Kohonen developed several slightly different SOM mechanisms over many years, starting with statistical covariance matrices. All of them comprise great ideas, for sure. Yet, in a wider perspective it is clear that there are many properties of the SOM that are presumably quite sub-optimal for realizing a generally applicable learning mechanism.
The Elements of SOMs
We shall recapitulate very briefly the principle of SOM below, more detailed descriptions can be found in many places in the Web (one of the best for the newbie, with some formulas and a demo software: ai-junkie), see also our document here that relates some issues to references, as well as our intro in plain language.
Yet, the question beyond all the mathematical formula stuff is: “What are the elements of a SOM?”
We propose to distinguish the following four basic elements:
- (1) a Collection of Items
that have memory for observations, or reflecting them, where all the items start with the same structure for these observations (items are often called “nodes”, or in a more romantic attitude “neurons”);
- (2) the Spatial Layout Principles
and the relational arrangement of this items;
- (3) an Influence Mechanism
that link the items together, and which together with the spatial layout defines the topology of the piece;
- (4) a Perceptional Mechanism
that introduces observations into the SOM in a particular manner.
In the case of the SOM these elements are configured in a way that creates a particular class of “learning” that we can describe as competitive-collaborative abstraction.
Those basic elements of a SOM can be parameterized—and thus also implemented—in very different ways. If we would take only the headlines of that list we could also subsume artificial neural networks (ANN) with these elements. Yet, even the items of a SOM and those of a ANN are drastically different. Else, the meaning of concepts like “layout” or “influence mechanism” are very different. This results in a completely different architecture regarding the relation of the “data”, or if you like potential observations, and the structure (SOM or ANN). Basically, ANNs are analytic,which means that the abstraction is (has to be done) done before the interaction of the structure with the data. In strong contrast to this approach, SOM build up an abstraction while interacting with the data. This abstraction is mostly consisting of the transition from extensional data to intensional representation. Thus SOM are able to find a structure, while ANN only can move within the apriori defined structure. In contrast to ANN, SOM are associative mechanisms (which is the reason why we are so fond of them)
Yet, it is also true for SOMs that the parametrization of the instances of the four elements as listed above have a significant influence on the capabilities and the potential of the resulting actual associative structure. Note that the design of the internals of the SOM does not refer to the issues of the usage or the embedding of the SOM into a wider context of modeling, or the structure of modeling itself.
In the following we will discuss the usual actualizations of those four elements, the respective drawbacks and better alternatives.
The SOM itself
Often one can find schematic representations like the one shown in the following figure 1:
Then this is usually described in this way: “The network is created from a 2D lattice of ‘nodes’, each of which is fully connected to the input layer.”
Albeit this is a possible description, it is a highly misleading one, with some quite unfavorable consequences: as we will see, it hides some important opportunities offered by the SOM mechanism.
Instead of speaking in an opaque manner about the “input layer” we simply can use the concept of “structured observations”. The structure is just given by the features used to establish or describe the observations. The important step that simplifies everything is to give all the nodes the same structure as the observations, at least in the beginning and as the trivial case; we will see that both assumptions may “develop away” as an effect of self-organization.
Anyway, the complicated connectivity in figure 1 changes into the following structure for the simple case:
Figure 2: An interpretation of the SOM grid, where the nodes are stuffed with the same structure (ordered set of variables) as the observations. This interpretation allows for a localizing of structures that is not achievable by the standard interpretation as shown in Fig.1.
To see what we gain by this change we have to visit briefly and partially the SOM mechanism.
The SOM mechanism compares a particular “incoming” observation to “all” nodes and determines a best matching node. The intensional part of this node then gets changed as a function of the given weight vector and the new observation. Some kind of intermediate between the observational vector and the intensional vector of the node is established. As a consequence, the nodes develop different intensional descriptions. This change upon matching with an observation then will be spread in the vicinity of the selected node, decaying with the distance, while this distance additionally is shrinking with increasing duration of the learning process. This is called the lateral control mechanism (LCM) by Kohonen (see Kohonen’s book 2001 p.179). This LCM is one of the most striking differences to so-called artificial neural networks (ANN).
It is now rather straightforward to think that the node keeps the index of the matching observation in its local memory. Over the course of learning, a node collects many records, which are all similar. This gathering of observations into an explicit collection is one of the MOST salient differences of our interpretation of the SOM to most of the standard interpretations!
Figure 3: As Fig.2, showing the extensional container of one of the nodes.
The consequences are highly significant: The SOM is not a tool for visualization any more, it is a mechanism with inherent and nevertheless transparent abstraction! To be explicit: While we retain the full power of the SOM mechanism we also not only get an explicit clustering, but even the opportunity for a fully validated modeling, inclusive a full description of the structure of the risk of mis-classification, hence there is no “black box” any more (as in contrast say to ANN, or even statistical methods).
Now we can see what we gained from changing the description, dropping the unholy concept of “input layer.” It now becomes clearly visible that nodes can be conceived of as containers, comprised of an extensional and an intensional part (as Carnap used the terms). The intensional part is what usually is called the weight vector of a node.The extensional part is the list of observations matching this intension.
The intensional part of a node thus represents a type. The extensional part of our revised SOM node represents the matching tokens.
But wait! As it is usual done, we called the intensional part of the node the “weight vector”. Yet, this is a drastic misnomer. It is not “weights” of the variables. It is simply a value that can be calculated in different ways, and which is influenced from different sides. It is a function of
- – the underlying extensional part = the list of records;
- – the similarity functional that is used for this node
- – the general network dynamics;
- – any kind of dynamic rule relating the new observation.
It is thus much more adequate to talk about an “intensionality profile” than about weights. Of course, we can additionally introduce real “weights” for each of the positions in a structure profile vector.
A second important advance of dropping this bad concept of “input layer” is that we can localize this function that results in the actualization of the intensional part of the node. For instance, we can localize the similarity function. As part of the similarity function we could even consider to implement a dynamic rule (dependent on the extensional content of the node) that excludes certain positions = variables as arguments from the determination of the similarity!
The third important consequence is that we created a completely new compartment, the “extensional container” of a node. Using the concept of “input layer” this compartment is simply not visible. Thus, the concept of the input layer violates central insights from the theory of epistemic action.
This “extensional container” is not just a list of records. We can conceive it as a “functional” compartment, that allows for a great deal of new flexibility and dynamics. This inner dynamics could be used to create new elements of the intensional part of the node, e.g. about the variance of the tokens contained in the “extensionality container”. Or about their relation as measured by the correlation. In fact, we could use any mechanism to create new positions in the intensional profile of node, even the properties of an embedded SOM, a small population of artificial neurons, the result parameters of statistical functions taking the list of observations as input and so on.
It is quite important to understand that the particular dynamics in the extensionality container is purely local. Notably the possibility for this dynamics also makes it possible to implement local differentiation of the SOM network, just as it is induced by the observations itself.
There is even a fourth implication of dropping the concept of input layer, which lead us to the separation between intensional and extensional aspects. This implication concerns the numerical production of the intensionality profile. Obviously we can regard the transition from the extensional description to the intensional representation. This abstraction, as any, is accompanied by a loss of information. Referring to the collection of intensional representations means to use them as a model. It is now very important to recognize that there is no explicit down-stream connection to the observations any more. All we have at our disposal are intensional representations that emerged as a consequence of the interaction of three components: (1) the observations, (2) the quasi-material aspects of the modeling procedure(particularly the associative part of it, of course), and (3) the imposed target/risk settings.
As a consequence we have to care explicitly about the variance structure within the extensional containers. More precisely, the internal variance of the extensional containers have to be “comparable.” If we would not care about that, we could not consider the intensional representations as comparable. We simply would compare apples with oranges, since some of the intensional representations simply would represent “a large mess”. On the level of intensionality profile one can’t see the variance anymore, hence we have to avoid the establishment of extensional groups (“micro-clusters”) that do not collect observations that are “similar” with regard to their descriptional values vector (inside the apriori given space of assignates). Astonishingly, this requirement of a homogenized extensional variance measure is overlooked even by Kohonen and his group, not to mention the implementations by countless epigonal fellows. It is clear that only the explicit distinction between intensional and extensional part of a model allows for the visibility of this important structural element.
Finally, and as a fifth consequence, we would like to emphasize that the explicit distinction between intensional and extensional parts opens the road towards a highly interesting region. We already mentioned that the transition from extensional description to intensional representation is a kind of abstraction. Yet, it is a simple kind of abstraction, closely tied to quasi-material aspects of the associative mechanism.
We may, however, easily derive the production of idealistic representations from that, if not even to say “ideas” in the philosophical sense. To achieve that we just have to extend the SOM with a production facility, the capability to simulate. This is of course not a difficult task. We will describe the details elsewhere (essay is scheduled), thus just a brief outline here. The “trick” is to use the intensional representations as seeds for generating surrogate observations by means of a Monte-Carlo simulation, such that the variance of the observations is a bit smaller than that of the empiric observations. Both, the empiric and surrogated “data” (nothing is “given” in the latter case) share the same space of assignates. The variance threshold can be derived dynamically from the SOM itself, it need not be predetermined at implementation time. As the next step one drops the extensional containers of the SOM and feeds the simulated data into it. After several loops of such self-referential modeling the intensional descriptions have “lost” their close ties to empirical data, yet, they are not completely unrelated. We still may use it as a kind of “template” in modeling, or for instance as a kind of null-model. In other words, the SOM contains the first traces of Platonic ideas.
Modeling. What else?
Above we emphasized that the SOM provides the opportunity for a fully validated modeling if we distinguish explicitly intensional and extensional parts in the make-up of the nodes. The SOM is, however, a strange thing, that can act in completely different ways.
In the chapter about modeling we concluded that a model without a purpose is not a model, or it is at most a strongly deficient model. Nevertheless, many people claim to create models without implying a purpose to the learning SOM. They call it “unsupervised clustering”. This is, of course, nonsense. It should be called more appropriately, “clustering with a deliberately hidden purpose,” since all the parameters of the SOM mechanisms and even the implementation act as constraints for the clustering, too. Any clustering mechanism applies a lot of criteria that influence the results. These constraints are supervised by the software, and the software has been produced by a human being (often called programmer), so this human being is supervising the clustering with a long arm. For the same reason one can not say the SOM is learning something and also not that we would train the SOM, without giving it a purpose.
Though the digesting of information by a SOM without a purpose being present is neither modeling nor learning, what can we conceive such a process as then?
The answer is pretty simple, and remember it becomes visible only after having dropped illegitimate ascriptions of mistaken concepts. This clustering has a particular epistemological role:
Self-organizing Maps that are running without purpose (i.e. target variables) are best described as associative storage devices. Nothing more, but above all, also nothing less.
Actually, this has to be rated as one of the greatest currently unrecognized opportunities in the field of machine learning. The reason is again inadequate wording. Of course, the input for such a map should be probabilized (randomized), and it has been already demonstrated how to accomplish this… guess by whom… by Teuvo Kohonen himself, while he was inventing the so-called WebSom. Kohonen proposed random neighborhoods for presenting snippets of texts to the SOM, which are a simple version of random contexts.
Importantly, once one recognizes the categorical differences between the target oriented modeling and the associative storage, it becomes immediately clear that there are strictly different methodological, hence quasi-morphological requirements. Astonishingly, even Kohonen himself, and any of his fellows as well, did not recognize the conceptual difference between the two flavors. He used SOMs created without target variable, i.e. without implying a purpose, as models for performing selections. Note that the principal mechanism of the SOM is the same for both approaches. There are just differences in the cost function(s) regarding the selection of variables.
There should be no doubt that any system intended to advance towards an autonomous machine-based episteme has to combine the two mechanism. There are sill other mechanisms, such like virtual movements, or virtual sequences in the abstract SOM space (we will describe that elsewhere), or the self-referential SOM for developing “crisp ideas”, but such a combination of associative storage and target oriented modeling is definitely inevitable (in our perspective… but we have strong arguments!).
SOM and Self-Organization
A small remark should be made here: Self-organizing maps are not in the same strong sense self-organizing as for instance Turing systems, or other Reaction-Diffusion Systems (RDS). A SOM gets organized by the interaction of its mechanisms and structures and the data. A SOM does not create patterns by it-SELF. Without feeding data into it, nothing happens, in stark contrast to self-organizing systems in the strong sense (see the example we already cited here), or take a look here from where we reproduced this parameter map for Gray-Scott Models.
Figure 4: The parameter map for Gray-Scott models, a particular Reaction-Diffusion System. Only for certain combinations of the two parameters of the system interesting patterns appear, and only for part of them the system remains dynamical, i.e. changing the layout of the patterns continuously.
As we discuss it in the chapter on complexity, it is pretty clear which kind of conditions must be at work to create the phenomenon of self-organization. None of them is present in Self-Organizing Maps; above all, SOMs are neither dissipative, nor are there antagonist influences.
Yet, it is not too difficult to create a self-organizing map that is really self-organizing. What is needed is either a second underlying process or inhibitory elements organized as population. In natural brains, we find both kinds of processes. The key for choosing the right starting point for implementing a system that is showing the transition from SOM to RDS is the complete probabilization of the idea of the network.
Our feeling is that at least one of them is mandatory in order to allow the system to develop logic as a category in an autonomous manner, i.e. not pre-programmed. As any other understanding, the ability to think in logical terms, or using logic as a category should not be programmed into a computer. That ability should emerge from the implemented conditions. Our claim that some concept is quite the opposite to something other is quite likely based on such processes. It is highly indicate in this context that the brain is indeed showing Turing patterns on the level of activity patterns, i.e. the patterns are not made of material entities, but are completely immaterial. Else, like in chemical clocks like the Belousov-Zhabotinsky system, another RDS, the natural brain shows a strong rhythmicity, both in its “local” activity patterns, as well as in the overall activity, affecting billions of cells at a time.
So far, the strong self-organization is not implemented in our FluidSOM.
Spatial Layout Principles
The spatial layout principle is a very important design aspect. It concerns not only the actual geometrical arrangement of nodes, but also their mobility as representations of physical entities. In the case of SOM this has to be taken quite abstract. The “physical entities” represented by the nodes are not neurons. The nodes represent functional roles of populations of neurons.
Usually, the SOM is defined as a collection of nodes that are arranged in a particular topology. This topology may be
- – grid like, 2-(3) dimensional;
- – as kind of a swarm in 2 dimensions;
- – as a gas, freely moving nodes.
The obvious difference between them is the degree of physical freedom for the nodes to move around. In grids, nodes are fixed and cannot move, while in the SOM gas the “nodes” are much more mobile.
There is also a quite important, yet not so obvious commonality between them. Firstly, in all of these layout principles the logical SOM nodes are identical with the “physical” items, i.e. representations of crossings in a grid, swarming entities, or gaseous containers. Thus, the data aspect of the nodes is not cleanly separated from its spatial behavior. If we separate it, the behavior of the nodes and the spatial aspects can be handled more transparently, i.e. the relevant parameters are better accessible.
Secondly, the space where those nodes are embedded is conceived as being completely neutral, as if those nodes would be arranged in deep space. Yet, everything we know of learning entities points to their mediality. In other words, the space that embeds the nodes should not be “empty”.
Using a Grid
In most of the cases the SOM is defined as a collection of nodes that are arrangement as a regular grid (4(8)n, 6n). Think of it as a fixed network like a regular wire fence, or the atomic bonds in a model of a crystal.
This layout is by far the most abundant one, yet it is the most restricted one. It is almost impossible, at least very difficult to make such a SOM dynamic, e.g. to provide it the potential to grow or to differentiate.
The advantage of grids is that it is quite easy to calculate the geometrical distance between the nodes, which is a necessary step to determine the influence between any two nodes. If the nodes are mobile, this measurement requires much much more efforts in terms of implementation.
Using Metaphors for Mobility: Swarms, or Gases
Here, the nodes may range freely. Their movement is strongly influenced (or even) restricted by the moves of its neighbors. Here, experience tells us the flocks of birds, or fishes, or bacteria, do not learn efficiently on the level of the swarm. Structures are destroyed to easy. The same is true for the gas metaphor.
Flexible Phase in a Mediating Space
Our proposal is to render the “phase” flexible according to the requirements that are important in a particular stage of learning. The nodes may be strictly arranged like in a crystal, or quite mobile, they may move around according to physical forces or according to their informational properties like the gathered data.
Ideally, the crystalline phases and the fluid phases are dependent on just a two or three parameters. One example for this is the “repulsive field”, a collection of items in a 2D space which repel each other. If the kinetic energy of those items is not too large, and the range of repellent force is not too low, this automatically leads to a hexagonal pattern. Yet, the pattern is not programmed as an apriori pattern. It is a result of properties of the items (and the embedding space). Such, the emergent arrangement is never affected by something like a “layout defect.”
Inserting a new item or removing one is very easy in such a structure. More important, the overall characteristics of the system does not change despite the fact that the actual pattern changes.
The Collection of Items : “Nodes”
In the classic SOM, nodes serve a double purpose:
- P1 – They serve as container for references that point to records of data (=observations);
- P2 – They present this extensional list in an integrated, “intensional” form ;
The intensional form of the list is simply the weight vector of that node. In the course of learning, the list of the records contained in a particular node will be selected such that they are increasingly similar.
Note that keeping the references to the data records is extremely important. It is NOT part of most SOM implementations. If we would not do it, we could not use the SOM as a modeling tool at all. This might be the reason why most people use the SOM just as visualization tool for data (which is a dramatic misunderstanding)
The nodes are not “directly” linked. Whether they influence each other or not is dependent on the distance between them and the neighborhood function. The neighborhood function determines the neighborhood, and it is a salient property of the SOM mechanism that this function changes over time. Important for our understanding of machine-based epistemology is that the relations between nodes in a SOM are potentially of a probabilistic character.
However, if we use a fixed grid, a fixed distance function, and a deterministically behaving neighborhood function, the resulting relations are not probabilistic any more.
Else, in case of default SOM, the nodes are passive. They even do not perform the calculation of the weight vector, which is performed by a central “update” loop in most implementations. In other words, in a standard SOM a node is a data structure.Here we arrive at a main point in our critique of the SOM
The common concept of a SOM is equivalent to a structural constant.
What we need, however, is something completely different. Even on the level of the nodes we need entities, that can change their structure and their relationality.
The concept of FluidSOM must be based on active nodes.
These active nodes are semi-autonomous. They calculate the weight vector themselves, based either on new input data, or some other “chemical” influences. They may develop a few long-range outgoing fibers or masses of more or less stable (but not “fixed”!) input relations to other nodes. The active meta-nodes in a fluid self-organizing map may develop a nested mini-SOM, or may incorporate any other mechanism for evaluating the data to which it is pointing to, e.g. a small neural network of a fixed structure (see mn-SOM). Meta-nodes also may branch out a further SOM instances locally into relative “3D”, e.g. dependent on its work load, or again, on some “chemical influences”
We see, that meta-nodes are dynamic structures, sth like a category of categories. This flexibility is indispensable for growing and differentiation.
This introduces the seed of autonomy on the lowest possible level. Here, within the almost material processes, it is barely autonomy, it is really a mechanic activity. Yet, this activity is NOT triggered by some reason any more. It is just there, as a property of the matter itself.
We are convinced that the top-level behavioral autonomy is (at least for large parts) an emergent property that grows out of the a-reasonable activity on the micro-material level.
The profile vector of a SOM node usually contains for all mutable variables (non-ID/TV) the average of the values in the extensional list. That is, the profile vector itself does not know anything about TV or index variable… which is solely the business of the Node.
In our case, however, and based on the principle of “strict locality,” the weight vector also may contain a further section, which is referring to dynamic properties of the node, or the data. We introduced this in a different way above when discussing the extensionality container of SOM nodes. For instance, the deviation of the data in the node against a model function (such as a correlation) such internal measurements can not be predefined, and they are also not stable input data since they are constantly changing (due to the list of data in the node, the state of other nodes etc.).
This introduces the possibility of self-referentiality on the lowest possible level. Similar to the case of autonomy, we find the seed for self-referentiality on the topmost-level (call it consciousness…) in midst the material layer.
If there is one lesson we can draw from the studies of naturally occurring brains, then it is the fact that there is no master code between neurons, no “Mentalese.” The brain does not work on the base of its own language. Equivalently, there are no logical circuits implementing logic calculus. As a correlate we can say that the brain is not a thing that consists of a definite wiring. A brain is not a finite state automaton, it does not make any sense to ascribe states to brains. Instead, everything going on in a brain is probabilistic, even on the sub-cellular level. It is not determined in a definite manner, how many vesicles have to burst in a synaptic gap to cause a transmission of the signal, it is not determined how many neurons exactly make up a working group for a particular “function” etc.etc. The only thing we can say is that certain fibers collect from certain “regions”, typically millions of neurons, to other such regions.
Note that any software program IS representable by just such a definite wiring. Hence, what we need is a mechanism that can transcend its own being as mechanism. We already discussed this issue in another chapter, where we identified abstract growth as a possible route to that achievement.
The processing of information in the brain is probabilistic, despite the fact that on the top level it “feels” different for us. Now, when starting to program artificial associative structures that are able to do similar things as a brain can accomplish, we have to respect this principle of probabilization.
We not only have to avoid hard-coded wiring between procedures. We have to avoid any explicit wiring at all. In terms of software architecture this translates into the proposal that we should not rely just on object-oriented programming (OOP). For instance, we would represent nodes in a SOM as objects, and the properties of these objects again would be other objects. OOP is an important, but certainly not a sufficient design element for a machine that shall develop its own episteme.
What we have to actualize in our implementation is not just OOP, but a messaging based architecture, where all elements are only loosely coupled. The Lateral Control Mechanism (LCM) of the Kohonen SOM is a nice example for this, the explicit wiring in ANN is perfect counter-example, a DON’T DO IT. Yet, as we will see in the next section, the LCM should not be considered as a symmetric and structurally constant functional entity!
Concerning programming style, on an even lower level this translates into the heavy use of so-called interfaces, as they are so prevalent in Java. Not objects are wired or passed around, but only interfaces. Interfaces are forward contracts about the standards for the interaction of different parts, that actually can change while the “program” is running.
Of course, these considerations regard only to the lowest, indeed material levels of an associative system, yet, they are necessary. If we start with wires of any kind, we won’t achieve our goals. From the philosophical perspective it does not come as a surprise that the immanence of autonomous abstraction is to be found only in open processes, which include the dimension of mediality. Even in the interaction of its tiniest parts the system should not rely on definite encodings.
During their development, natural systems differentiate in their parts. Bodies are comprised of organs, organs are made of different cell types, within all members of a cell a further differentiation of their actual and context-specific role may occur. The same can be observed in social insects, or any other group of social beings. They are morphologically almost identical, yet, their experience let them do their tasks differentially, or even let them do different tasks. Why then should we assume that all neurons in a large compound should act absolutely equally?
To illustrate the point we should visit a particular African termite species (Schedorhinotermes lamanianus) on which I worked as a young biologist. They are feeding on rodden/rodding wood. Well, since these pieces of wood are much larger than the termites, a problem occurs. The animals have to organize their collective foraging, i.e. where to stay and gnaw onto the wood, and where to travel to return the harvested pieces back to home nest, where they then put it to a processing chamber stuffed with a special kind of fungus. The termites then actually feed that fungus, and mostly not the wood. (though they have also bacteria in their gut to do the job of digesting the cellulose and the lignine.
Important for us is the foraging process. To organize gnawing sites and traveling routes they use pheromones, and no wonder, they use just 2 for that, which build a Turing system, as I proofed with a small bio-test together with a colleague.
In the nervous system of animals we find a similar problematics. The brain is not just a large network, over and over symmetric like a crystal. Of course not. There are compartments (see our chapter about complexity), there are fibers. The various parts of the brain even differ strongly with respect to their topological structure, their “wiring”. Why the heck should an artificial system look like a perfect crystal? In a crystal their will be no stable emergence, hence no structural learning. By the way, we should not expect structural learning in swarms either, for a very similar reason, albeit that reason instantiates in the opposite manner: complete perturbation prevents the emergence of compartments, too, hence no structural learning will be observed (That’s the reason why we do not have swarms in the skull…)
Back to our neurons. We reject the approach of a direct representational simulation of neurons, or parts of the brain. Instead we propose to focus the principles as elements of construction. Any system that is intended to show structural learning, is in urgent need of the basic differentiation into “local” and “tele” (among others). Here we meet even a structural parallelism to large urban compounds.
We can implement the emergence of such fibers in a straightforward manner, if we make it dependent on the occurrence of reproducing / repeating co-excitation of regions. This implies that we have to soften the SOM principle of the “winner-takes-it-all” approach. At least in large networks, any given observation should possibly leave its trace in different regions. Yet, our experience with very large maps indicate that this may happen almost inevitably. We just used very simple observations consisting of only 3 features (r,g, and b, such forming the RGB color triplet) and a large SOM, consisting of around 1’000’000 nodes. The topology was 4n, and the map was placed on a torus (no borders). After approx 200’000 observations, the uniqueness for color concepts started to become eroded. For some colors, two conceptual regions appeared.
In the further development of such SOMs, it is then quite naturally to let fibers grow between such regions, changing the topology of the SOM from that of a crystal to that of a brain. While the first is almost perfectly isotropic in exactly 3 dimensions, the topology of the brain is (due to the functional differentiation into tele-fibres) highly anisotropic in a high and variable dimensionality.
Here we discussed some basic design issues about self-organizing maps and introduced some improvements. We have seen that wording matters when it comes to represent even a mechanism. The issues we touched have been
- – explicit distinction of intensionality and extensionality in the conceptualization of the SOM mechanism, leading to a whole “new” domain of SOM architectures;
- – producing idealistic representations from a collection of extensional descriptions;
- – dynamics in the extensionality domain, including embedding of other structures, thus proceeding to the principle of compartmentalization, functional differentiation and morphological growth;
- – the distinction between modeling and associative storage, which require different morphological structures once they are distinguished;
- – stuffing the SOM with self-organization in the strong sense;
- – spatial layout, fixed rid versus the emergent patterns in a repulsion field of freely moving particles; distinguishing material particles from functional abstract nodes;
- – nodes as active components of the grid;
- – self-referentiality on the microscopic level that gives rise to emergent self-referentiality on the macroscopic level;
- – programming style, which should not only be as abstract (and thus as general) as possible, but also has to proceed from strictly defined, strongly coupled object-oriented style to loosely coupled system based on messaging, even on the lowest levels of implementation, e.g. the interaction of nodes;
- – functional differentiation of nodes, leading to dynamic, fractional dimensionality and topological anisotropy;
Yet, there are still much more aspects that have to be considered if one would try to approach processes on machinic substrate that could be give rise to what we call “thinking.” In discussing the design issues listed above, we remain quite on the material level. But of course, morphology is important. Nevertheless we should not conceive of morphology as a perfect instance of a blueprint, it is more about the potential, if not to say the “virtuality”, that is implied as immanence by the morphology. Beyond that morphology, we have to design the processes of dynamic change of that morphology, which we usually call growth, or tissue differentiation. Even on top of that, we have to think about the informational, i.e. immaterial processes, that only eventually lead to morphological correlates.
Anyway, when thinking about machine-based episteme, we obviously have to forget about crystals and swarms, about perfectness and symmetry in morphological structures. Instead, the design of all of the issues, whether material or immaterial, should be designed with the perspective towards an immanence of virtuality in mind, based on probabilized mechanisms.
In a further chapter (scheduled) we will try to approach two other design issues about the implementation of an advanced Self-organizing Map in more detail that we already mentioned briefly here, again oriented at basic abstract elements and the principles found in natural brains: inhibitory processes and probabilistic negation on the one hand and the chemical milieu on the other. Above we already indicated that we expect a continuum between Self-organizing Maps and Reaction-Diffusion Systems, which in our perspective is highly significant for the working of brains, whether natural or artificial ones.