Beyond Containing: Associative Storage and Memory
February 14, 2012 § Leave a comment
Memory, our memory, is a wonderful thing. Most of the time.
Yet, it also can trap you, sometimes terribly, if you use it in inappropriate ways.
Think about the problematics of being a witness. As long as you don’t try to remember exactly you know precisely. As soon as you start to try to achieve perfect recall, everything starts to become fluid, first, then fuzzy and increasingly blurry. As if there would be some kind of uncertainty principle, similar to Heisenberg’s . There are other tricks, such as asking a person the same question over and over again. Any degree of security, hence knowledge, will vanish. In the other direction, everybody knows about the experience that a tiny little smell or sound triggers a whole story in memory, and often one that have not been cared about for a long time.
The main strengths of memory—extensibility, adaptivity, contextuality and flexibility—could be considered also as its main weakness, if we expect perfect reproducibility for results of “queries”. Yet, memory is not a data base. There are neither symbols, nor indexes, and at the deeper levels of its mechanisms, also no signs. There is no particular neuron that would “contain” information as a file on a computer can be regarded able to provide.
Databases are, of course, extremely useful, precisely because they can’t do in other ways as to reproduce answers perfectly. That’s how they are designed and constructed. And precisely for the same reason we may state that databases are dead entities, like crystals.
The reproducibility provided by databases expels time. We can write something into a database, stop everything, and continue precisely at the same point. Databases do not own their own time. Hence, they are purely physical entities. As a consequence, databases do not/can not think. They can’t bring or put things together, they do not associate, superpose, or mix. Everything is under the control of an external entity. A database does not learn when the amount of bits stored inside it increases. We also have to be very clear about the fact that a database does not interpret anything. All this should not be understood as a criticism, of course, these properties are intended by design.
The first important consequence about this is that any system relying just on the principles of a database also will inherit these properties. This raises the question about the necessary and sufficient conditions for the foundations of “storage” devices that allow for learning and informational adaptivity.
As a first step one could argue that artificial systems capable for learning, for instance self-organizing maps, or any other “learning algorithm”, may consist of a database and a processor. This would represent the bare bones of the classic von Neumann architecture.
The essence of this architecture is, again, reproducibility as a design intention. The processor is basically empty. As long as the database is not part of a self-referential arrangement, there won’t be something like a morphological change.
Learning without change of structure is not learning but only changing the value of structural parameters that have been defined apriori (at implementation time). The crucial step however would be to introduce those parameters at all. We will return to this point at a later stage of our discussion, when it comes to describe the processing capabilities of self-organizing maps.1
Of course, the boundaries are not well defined here. We may implement a system in a very abstract manner such that a change in the value of such highly abstract parameters indeed involves deep structural changes. In the end, almost everything can be expressed by some parameters and their values. That’s nothing else than the principle of the Deleuzean differential.
What we want to emphasize here is just the issue that (1) morphological changes are necessary in order to establish learning, and (2) these changes should be established in response to the environment (and the information flowing from there into the system). These two condition together establish a third one, namely that (3) a historical contingency is established that acts as a constraint on the further potential changes and responses of the system. The system acquires individuality. Individuality and learning are co-extensive. Quite obviously, such a system is not a von Neumann device any longer, even if it still runs on a such a linear machine.
Our claim here is that the “learning” requires a particular perspective on the concept of “data” and its “storage.” And, correspondingly, without the changed concept about the relation between data and storage, the emergence of machine-based episteme will not be achievable.
Let us just contrast the two ends of our space.
- (1) At the logical end we have the von Neumann architecture, characterized by empty processors, perfect reproducibility on an atomic level, the “bit”; there is no morphological change; only estimation of predefined parameters can be achieved.
- (2) The opposite end is made from historically contingent structures for perception, transformation and association, where the morphology changes due to the interaction with the perceived information2; we will observe emergence of individuality; morphological structures are always just relative to the experienced influences; learning occurs and is structural learning.
With regard to a system that is able to learn, one possible conclusion from that would be to drop the distinction between storage of encoded information and the treatment of that encodings. Perhaps, it is the only viable conclusion to this end.
In the rest of this chapter we will demonstrate how the separation between data and their transformation can be overcome on the basis of self-organizing maps. Such a device we call “associative storage”. We also will find a particular relation between such an associative storage and modeling3. Notably, both tasks can be accomplished by self-organizing maps.
When taking the perspective from the side of usage there is still another large contrasting difference between databases and associative storage (“memories”). In case of a database, the purpose of a storage event is known at the time of performing the storing operation. In case of memories and associative storage this purpose is not known, and often can’t be reasonably expected to be knowable by principle.
From that we can derive a quite important consequence. In order to build a memory, we have to avoid storing the items “as such,” as it is the case for databases. We may call this the (naive) representational approach. Philosophically, the stored items do not have any structure inside the storage device, neither an inner structure, nor an outer one. Any item appears as a primitive qualia.
The contrast to the process in an associative storage is indeed a strong one. Here, it is simply forbidden to store items in an isolated manner, without relation to other items, as an engram, an encoded and reversibly decodable series of bits. Since a database works perfectly reversible and reproducible, we can encode the graphem of a word into a series of bits and later decode that series back into a graphem again, which in turn we as humans (with memory inside the skull) can interpret as words. Strictly taken, we do NOT use the database to store words.
More concretely, what we have to do with the items comprises two independent steps:
- (1) Items have to be stored as context.
- (2) Items have to be stored as probabilized items.
The second part of our re-organized approach to storage is a consequence of the impossibility to know about future uses of a stored item. Taken inversely, using a database for storage always and strictly implies that the storage agent claims to know perfectly about future uses. It is precisely this implication that renders long-lasting storage projects so problematic, if not impossible.
In other words, and even more concise, we may say that in order to build a dynamic and extensible memory we have to store items in a particular form.
Memory is built on the basis of a population of probabilistic contexts in and by an associative structure.
The Two-Layer SOM
In a highly interesting prototypical model project (codename “WEBSOM”) Kaski (a collaborator of Kohonen) introduced a particular SOM architecture that serves the requirements as described above . Yet, Kohonen (and all of his colleagues alike) did not recognize so far the actual status of that architecture. We already mentioned this point in the chapter about some improvements of the SOM design; Kohonen fails to discern modeling from sorting, when he uses the associative storage as a modeling device. Yet, modeling requires a purpose, operationalized into one or more target criteria. Hence, an associative storage device like the two-layer SOM can be conceived as a pre-specific model only.
Nevertheless, this SOM architecture is not only highly remarkable, but we also can easily extend it appropriately; thus it is indeed so important, at least as a starting point, that we describe it briefly here.
Context and Basic Idea
The context for which the two-layer SOM (TL-SOM) has been created is document retrieval by classification of texts. From the perspective of classification,texts are highly complex entities. This complexity of texts derives from the following properties:
- – there are different levels of context;
- – there are rich organizational constraints, e.g. grammars
- – there is a large corpus of words;
- – there is a large number of relations that not only form a network, but which also change dynamically in the course of interpretation.
Taken together, these properties turn texts into ill-defined or even undefinable entities, for which it is not possible to provide a structural description, e.g. as a set of features, and particularly not in advance to the analysis. Briefly, texts are unstructured data. It is clear, that especially non-contextual methods like the infamous n-grams are deeply inappropriate for the description, and hence also for the modeling of texts. The peculiarity of texts has been recognized long before the age of computers. Around 1830 Friedrich Schleiermacher founded the discipline of hermeneutics as a response to the complexity of texts. In the last decades of the 20ieth century, it was Jacques Derrida who brought in a new perspective on it. in Deleuzean terms, texts are always and inevitably deterritorialized to a significant portion. Kaski & coworkers addressed only a modest part of these vast problematics, the classification of texts.
The starting point they took by was to preserve context. The large variety of contexts makes it impossible to take any kind of raw data directly as input for the SOM. That means that the contexts had to be encoded in a proper manner. The trick is to use a SOM for this encoding (details in next section below). This SOM represents the first layer. The subject of this SOM are the contexts of words (definition below). The “state” of this first SOM is then used to create the input for the SOM on the second layer, which then addresses the texts. In this way, the size of the input vectors are standardized and reduced in size.
Elements of a Two-Layer SOM
The elements, or building blocks, of a TL-SOM devised for the classification of texts are
- (1) random contexts,
- (2) the map of categories (word classes)
- (3) the map of texts
The Random Context
A random context encodes the context of any of the words in a text. let us assume for the sake of simplicity that the context is bilateral symmetric according to 2n+1, i.e. for example with n=3 the length of the context is 7, where the focused word (“structure”) is at pos 3 (when counting starts with 0).
Let us resort to the following example, that take just two snippets from this text. The numbers represent some arbitrary enumeration of the relative positions of the words.
|sequence A of words rel. positions in text||“… without change of structureis not learning …”53 54 55 56 57 58 59|
|sequence B of words rel. positions in text||“… not have any structureinside the storage …”19 20 21 22 23 24 25|
The position numbers we just need for calculating the positional distance between words. The interesting word here is “structure”.
For the next step you have to think about the words listed in a catalog of indexes, that is as a set whose order is arbitrary but fixed. In this way, any of the words gets its unique numerical fingerprint.
|1264||structure||0.270 0.938 0.417 0.299 0.991 …|
|1265||learning||0.330 0.990 0.827 0.828 0.445 …|
|1266||Alabama||0.375 0.725 0.435 0.025 0.915 …|
|1267||without||0.422 0.072 0.282 0.157 0.155 …|
|1268||storage||0.237 0.345 0.023 0.777 0.569 …|
|1269||not||0.706 0.881 0.603 0.673 0.473 …|
|1270||change||0.170 0.247 0.734 0.383 0.905 …|
|1271||have||0.735 0.472 0.661 0.539 0.275 …|
|1272||inside||0.230 0.772 0.973 0.242 0.224 …|
|1273||any||0.509 0.445 0.531 0.216 0.105 …|
|1274||of||0.834 0.502 0.481 0.971 0.711 …|
|1274||is||0.935 0.967 0.549 0.572 0.001 …|
Any of the words of a text can now be replaced by an apriori determined vector of random values from [0..1]; the dimensionality of those random vectors should be around 80 in order to approximate orthogonality among all those vectors. Just to be clear: these random vectors are taken from a fixed codebook, a catalog as sketched above, where each word is assigned to exactly one such vector.
Once we have performed this replacement, we can calculate the averaged vectors per relative position of the context. In case of the example above, we would calculate the reference vector for position n=0 as the average from the vectors encoding the words “without” and “not”.
Let us be more explicit. For example sentence A we translate first into the positional number, interpret this positional number as a column header, and fill the column with the values of its respective fingerprint. For the 7 positions (-3, +3) we get 7 columns:
|sequence A of words||“… without change of structure is not learning …”|
|rel. positions in text||53 54 55 56 57 58 59|
|grouped around “structure”||-3 -2 -1 0 1 2 3|
…further entries of the fingerprints…
The same we have to do for the second sequence B. Now we have to tables of fingerprints, both comprising 7 columns and N rows, where N is the length of the fingerprint. From these two tables we calculate the average value and put it into a new table (which is of course also of dimensions 7xN). Such, the example above yields 7 such averaged reference vectors. If we have a dimensionality of 80 for the random vectors we end up with a matrix of [r,c] = [80,7].
In a final step we concatenate the columns into a single vector, yielding a vector of 7×80=560 variables. This might appear as a large vector. Yet, it is much smaller than the whole corpus of words in a text. Additionally, such vectors can be compressed by the technique of random projection (math. foundations by , first proposed for data analysis by , utilized for SOMs later by  and ), which today is quite popular in data analysis. Random projection works by matrix multiplication. Our vector (1R x 560C) gets multiplied with a matrix M(r) of 560R x 100C, yielding a vector of 1R x 100C. The matrix M(r) also consists of flat random values. This technique is very interesting, because no relevant information is lost, but the vector gets shortened considerable. Of course, in an absolute sense there is a loss of information. Yet, the SOM only needs the information which is important to distinguish the observations.
This technique of transferring a sequence made from items encoded on an symbolic level into a vector that is based on random context can be applied to any symbolic sequence of course.
For instance, it would be a drastic case of reductionism to conceive of the path taken by humans in an urban environment just as a sequence locations. Humans are symbolic beings and the urban environment is full of symbols to which we respond. Yet, for the population-oriented perspective any individual path is just a possible path. Naturally, we interpret it as a random path. The path taken through a city needs to be described both by location and symbol.
The advantage of the SOM is that the random vectors that encode the symbolic aspect can be combined seamlessly with any other kind of information, e.g. the locational coordinates. That’s the property of the multi-modality. Which particular combination of “properties” then is suitable to classify the paths for a given question then is subject for “standard” extended modeling as described inthe chapter Technical Aspects of Modeling.
The Map of Categories (Word Classes)
From these random context vectors we can now build a SOM. Similar contexts will arrange in adjacent regions.
A particular text now can be described by its differential abundance across that SOM. Remember that we have sent the random contexts of many texts (or text snippets) to the SOM. To achieve such a description a (relative) frequency histogram is calculated, which has as much classes as the SOM node count is. The values of the histogram is the relative frequency (“probability”) for the presence of a particular text in comparison to all other texts.
Any particular text is now described by a fingerprint, that contains highly relevant information about
- – the context of all words as a probability measure;
- – the relative topological density of similar contextual embeddings;
- – the particularity of texts across all contextual descriptions, again as a probability measure;
Those fingerprints represent texts and they are ready-mades for the final step, “learning” the classes by the SOM on the second layer in order to identify groups of “similar” texts.
It is clear, that this basic variant of a Two-Layer SOM procedure can be improved in multiple ways. Yet, the idea should be clear. Some of those improvements are
- – to use a fully developed concept of context, e.g. this one, instead of a constant length context and a context without inner structure;
- – evaluating not just the histogram as a foundation of the fingerprint of a text, but also the sequence of nodes according to the sequence of contexts; that sequence can be processed using a Markov-process method, such as HMM, Conditional Random Fields, or, in a self-similar approach, by applying the method of random contexts to the sequence of nodes;
- – reflecting at least parts of the “syntactical” structure of the text, such as sentences, paragraphs, and sections, as well as the grammatical role of words;
- – enriching the information about “words” by representing them not only in their observed form, but also as their close synonyms, or stuffed with the information about pointers to semantically related words as this can be taken from labeled corpuses.
We want to briefly return to the first layer. Just imagine not to measure the histogram, but instead to follow the indices of the contexts across the developed map by your fingertips. A particular path, or virtual movement appears. I think that it is crucial to reflect this virtual movement in the input data for the second layer.
The reward could be significant, indeed. It offers nothing less than a model for conceptual slippage, a term which has been emphasized by Douglas Hofstadter throughout his research on analogical and creative thinking. Note that in our modified TL-SOM this capacity is not an “extra function” that had to be programmed. It is deeply built “into” the system, or in other words, it makes up its character. Besides Hofstadter’s proposal which is based on a completely different approach, and for a different task, we do not know of any other system that would be able for that. We even may expect that the efficient production of metaphors can be achieved by it, which is not an insignificant goal, since all the practiced language is always metaphoric.
We already mentioned that the method of TL-SOM extracts important pieces of information about a text and represents it as a probabilistic measure. The SOM does not contain the whole piece of text as single entity, or a series of otherwise unconnected entities, the words. The SOM breaks the text up into overlapping pieces, or better, into overlapping probabilistic descriptions of such pieces.
It would be a serious misunderstanding to perceive this splitting into pieces as a drawback or failure. It is the mandatory prerequisite for building an associative storage.
Any further target oriented modeling would refer to the two layers of a TL-SOM, but never to the raw input text.Such it can work reasonable fast for a whole range of different tasks. One of those tasks that can be solved by a combination of associative storage and true (targeted) modeling is to find an optimized model for a given text, or any text snippet, including the identification of the discriminating features. We also can turn the perspective around, addressing the query to the SOM about an alternative formulation in a given context…
From Associative Storage towards Memory
Despite its power and its potential as associative storage, the Two-Layer SOM still can’t be conceived as a memory device. The associative storage just takes the probabilistically described contexts and sorts it topologically into the map. In order to establish “memory” further components are required that provides the goal orientation.
Within the world of self-organizing maps, simple (!) memories are easy to establish. We just have to combine a SOM that acts as associative storage with a SOM for targeted modeling. The peculiar distinctive feature of that second SOM for modeling is that it does not work on external data, but on “data” as it is available in and as the SOM that acts as associative storage.
We may establish a vivid memory in its full meaning if we establish two further components: (1) targeted modeling via the SOM principle, (2) a repository about the targeted models that have been built from (or using) the associative storage, and (3) at least a partial operationalization of a self-reflective mechanism, i.e. a modeling process that is going to model the working of the TL-SOM. Since in our framework the basic SOM module is able to grow and to differentiate, there is no principle limitation of/for such a system any more, concerning its capability to build concepts, models, and (logical) habits for navigating between them. Later, we will call the “space” where this navigation takes place “choreosteme“: Drawing figures into the open space of epistemic conditionability.
From such a memory we may expect dramatic progress concerning the “intelligence” of machines. The only questionable thing is whether we should call such an entity still a machine. I guess, there is neither a word nor a concept for it.
1. Self-organizing maps have some amazing properties on the level of their interpretation, which they share especially with the Markov models. As such, the SOM and Markov models are outstanding. Both, the SOM as well as the Markov model can be conceived as devices that can be used to turn programming statements, i.e. all the IF-THEN-ELSE statements occurring in a program as DATA. Even logic itself, or more precisely, any quasi-logic, is getting transformed into data.SOM and Markov models are double-articulated (a Deleuzean notion) into logic on the one side and the empiric on the other.
In order to achieve such, a full write access is necessary to the extensional as well as the intensional layer of a model. Hence, artificial neuronal networks (nor, of course, statistical methods like PCA) can’t be used to achieve the same effect.
2. It is quite important not to forget that (in our framework) information is nothing that “is out there.” If we follow the primacy of interpretation, for which there are good reasons, we also have to acknowledge that information is not a substantial entity that could be stored or processed. Information is nothing else than the actual characteristics of the process of interpretation. These characteristics can’t be detached from the underlying process, because this process is represented by the whole system.
3. Keep in mind that we only can talk about modeling in a reasonable manner if there is an operationalization of the purpose, i.e. if we perform target oriented modeling.
-  Werner Heisenberg. Uncertainty Principle.
-  Samuel Kaski, Timo Honkela, Krista Lagus, Teuvo Kohonen (1998). WEBSOM – Self-organizing maps of document collections. Neurocomputing 21 (1998) 101-117.
-  W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189–206. Amer. Math. Soc., 1984.
-  R. Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. In J.M. Zurada, R.J. Marks II, and C.J. Robinson, editors, Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994.
-  Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. Proceedings of the Seventeenth ACM Symposium on the Principles of Database Systems (pp. 159-168). ACM press.
-  Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: Applications to image and text data. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 245-250). ACM Press.