The Text Machine

July 10, 2012 § Leave a comment

What is the role of texts? How do we use them (as humans)?

How do we access them (as reading humans)? The answers to such questions seem to be pretty obvious. Almost everybody can read. Well, today. Noteworthy, reading itself, as a performance and regarding its use, changed dramatically at least two times in history: First, after the invention of the vocal alphabet in ancient Greece, and the second time after book printing became abundant during the 16th century. Maybe, the issue around reading isn’t so simple as it seems in everyday life.

Beyond such accounts of historical issues and basic experiences, we have a lot of more theoretical results concerning texts. Beginning with Friedrich Schleiermacher who was the first to identify hermeneutics as a subject around 1830 and formulated it in a way that has been considered as more complete and powerful than the version proposed by Gadamer in the 1950ies. Proceeding of course with Wittgenstein (language games, rule following), Austin (speech act theory) or Quine (criticizing empirism). Philosophers like John Searle, Hilary Putnam and Robert Brandom then explicating and extending the work of the former heroes. And those have been accompanied by many others. If you wonder about linguistics missing here, well, then because linguistics does not provide theories about language. Today, the domain is largely caught by positivism and the corresponding analytic approach.

Here in his little piece we pose these questions in the context of certain relations between machines and texts. There are a lot of such relations, and even quite sophisticated or surprising ones. For instance, texts can be considered as kind of machines. Yet, they bear a certain note of (virtual) agency as well, resulting in a considerable non-triviality of this machine aspect of texts. Here we will not deal with this perspective. Instead, we just will take a look on the possibilities and the respective practices to handle or to “treat” texts with machines. Or, if you prefer, the treating of texts by machines, as far as a certain autonomy of machines could be considered as necessary to deal with texts at all.

Today, we can find a fast growing community of computer programmers that are dealing with texts as kind of unstructured information. One of the buzz-words is the so-called “semantic web”, another one is “sentiment analysis”. We won’t comment in any detail about those movements, because they are deeply flawed. The first one is trying to formalize semantics and meaning apriori, trying to render the world into a trivial machine. We repeatedly criticized this and we agree herein with Douglas Hofstadter. (see this discussion of his “Fluid Analogy”). The second is trying to identify the sentiment of a text or a “tweet”, e.g. about a stock or an organization, on the basis of statistical measures about keywords and their utterly naive “n-grammed” versions, without actually paying any notice to the problem of “understanding”. Such nonsense would not be as widespread if programmers would read only a few fundamental philosophical texts about language. In fact, they don’t, and thus they are condemned to visit any of the underdeveloped positions that arose centuries ago.

If we neglect the social role of texts for a moment, we might identify a single major role of texts, albeit we have to describe it then in rather general terms. We may say that the role of a text, as a specimen of many other texts from a large population, is its functioning as a medium for the externalization of mental content in order to serve the ultimate purpose, which consists of the possibility for a (re)construction of resembling mental content on the side of the interpreting person.

This interpretation is a primacy. It is not possible to assign meaning to text like a sticky note, then putting the text including the yellow sticky note directly into the recipients brain. That may sound silly, but unfortunately it’s the “theory” followed by many people working in the computer sciences. Interpretation can’t be controlled completely, though, not even by the mind performing it, not even by the same mind who seconds before externalized the text through writing or speaking.

Now, the notion of mental content may seem both quite vague and hopelessly general as well. Yet, in the previous chapter we introduced a structure, the choreostemic space, which allows to speak pretty precise about mental content. Note that we don’t need to talk about semantics, meaning or references to “objects” here. Mental content is not a “state” either. Thinking “state” and the mental together is much on the same stage as to seriously considering the existence of sea monsters in the end of 18th century, when the list science of Linnaeus was not yet reshaped by the upcoming historical turn in the philosophy of nature. Nowadays we must consider it as silly-minded to think about a complex story like the brain and its mind by means of “state”. Doing so, one confounds the stability of the graphical representation of a word in a language with the complexity of a multi-layered dynamic process, spanned between deliberate randomness, self-organized rhythmicity and temporary thus preliminary meta-stability.

The notion of mental content does not refer to the representation of referenced “objects”. We do not have maps, lists or libraries in our heads. Everything which we experience as inner life builds up from an enormous randomness through deep stacks of complex emergent processes, where each emergent level is also shaped from top-down, implicitly and, except the last one usually called “consciousness,” also explicitly. The stability of memory and words, of feelings and faculties is deceptive, they are not so stable at all.  Only their externalized symbolic representations are more or less stable, their stability as words etc.  can be shattered easily. The point we would like to emphasize here is that everything that happens in the mind is constructed on the fly, while the construction is completed only with the ultimate step of externalization, that is, speaking or writing. The notion of “mental content” is thus a bit misleading.

The mental may be conceived most appropriately as a manifold of stacked and intertwined processes. This holds for the naturalist perspective as well as for the abstract perspective, as he have argued in the previous chapter. It is simply impossible to find a single stable point within the (abstract) dynamics between model, concept, mediality and virtuality, which could be thought of as spanning a space. We called it the choreostemic space.

For the following remarks about the relation between text and machines and the practitioners engaged in building machines to handle texts we have to keep in mind just those two things: (i) there is a primacy of interpretation, (ii) the mental is a non-representative dynamic process that can’t be formalized (in the sense of “being represented” by a formula).

In turn this means that we should avoid to refer to formulas when going to build a “text machine”. Text machines will be helpful only if their understanding of texts, even if it is a rudimentary understanding, follows the same abstract principles as our human understanding of texts does. Machines pretending to deal with texts, but actually only moving dead formal symbols back and forth, as it is the case in statistical text mining, n-gram based methods and similar, are not helpful at all. The only thing that happens is that these machines introduce a formalistic structure into our human life. We may say that these techniques render humans helpful to machines.

Nowadays we can find a whole techno-scientific community that is engaged in the field of machine learning, devised to “textual data”. The computers are programmed in such a way that they can be used to classify texts. The idea is to provide some keywords, or anti-words, or even a small set of sample texts, which then are taken by the software as a kind of template that is used to build a selection model. This model then is used to select resembling texts from a large set of texts. We have to be very clear about the purpose of these software programs: they classify texts.

The input data for doing so is taken from the texts themselves. More precisely, they are preprocessed according to specialized methods. Each of the texts gets described by a possibly large set of “features” that have been extracted by these methods. The obvious point is that the procedure is purely empirical in the strong sense. Only the available observations (the texts) are taken to infer the “similarity” between texts. Usually, not even linguistic properties are used to form the empirical observations, albeit there are exceptions. People use the so-called n-gram approach, which is only little more than counting letters. It is a zero-knowledge model about the series of symbols, which humans interpret as text. Additionally, the frequency or relative positions of keywords and anti-words are usually measured and expressed by mostly quite simple statistical methods.

Well, classifying texts is something that is quite different from understanding texts. Of course. Yet, said community tries to reproduce the “classification” achieved or produced by humans. Such, any of the engineers of the field of machine learning directed to texts implicitly claims kind of an understanding. They even organize competitions.

The problems with the statistical approach are quite obvious. Quine called it the dogma of empiricism and coined the Gavagai anecdote about it, which even provides much more information than the text alone. In order to understand a text we need references to many things outside the particular text(s) at hand. Two of those are especially salient: concepts and the social dimension. Straightly opposite to the believe of positivists, concepts can’t be defined in advance to a particular interpretation. Using catalogs of references does not help much, if these catalogs are used just as lists of references. The software does not understand “chair” by the “definition” stored in a database, or even by the set of such references. It simply does not care whether there are encoded ASCII codes that yield the symbol “chair” or the symbol “h&e%43”. Douglas Hofstadter has been stressing this point over and over again, and we fully agree to that.

From that necessity to a particular and rather wide “background” (notion by Searle) the second problem derives, which is much more serious, even devastating to the soundness of the whole empirico-statistical approach. The problem is simple: Even we humans have to read a text before being able to understand it. Only upon understanding we could classify it. Of course, the brain of many people is trained sufficiently as to work about the relations of the texts and any of its components while reading the text. The basic setup of the problem, however, remains the same.

Actually, what is happening is a constantly repeated re-reading of the text, taking into account all available insights regarding the text and the relations of it to the author and the reader, while this re-reading often takes place in the memory. To perform this demanding task in parallel, based on the “cache” available from memory, requires a lot of experience and training, though. Less experienced people indeed re-read the text physically.

The consequence of all of that is that we could not determine the best empirical discriminators for a particular text in-the-reading in order to select it as-if we would use a model. Actually, we can’t determine the set of discriminators before we have read it all, at least not before the first pass. Let us call this the completeness issue.

The very first insight is thus that a one-shot approach in text classification is based on a misconception. The software and the human would have to align to each other in some kind of conversation. Otherwise it can’t be specified in principle what the task is, that is, which texts should actually be selected. Any approach to text classification not following the “conversation scheme” is necessarily bare nonsense. Yet, that’s not really a surprise (except for some of the engineers).

There is a further consequence of the completeness issue. We can’t set up a table to learn from at all. This too is not a surprise, since setting up a table means to set up a particular symbolization. Any symbolization apriori to understanding must count as a hypothesis. Such simple. Whether it matches our purpose or not, we can’t know before we didn’t understand the text.

However, in order to make the software learning something we need assignates (traditionally called “properties”) and some criteria to distinguish better models from less performant models. In other words, we need a recurrent scheme on the technical level as well.

That’s why it is not perfectly correct to call texts “unstructured data”. (Besides the fact that data are not “out there”: we always need a measurement device, which in turn implies some kind of model AND some kind of theory.) In the case of texts, imposing a structure onto a text simply means to understand it. We even could say that a text as text is not structurable at all, since the interpretation of a text can’t never be regarded as finished.

All together, we may summarize the issue of complexity of texts as deriving from the following properties in the following way:

  • – there are different levels of context, which additionally stretch across surrounds of very different sizes;
  • – there are rich organizational constraints, e.g. grammars
  • – there is a large corpus of words, while any of them bears meaning only upon interpretation;
  • – there is a large number of relations that not only form a network, but which also change dynamically in the course of reading and of interpretation;
  • – texts are symbolic: spatial neighborhood does not translate into reference, in neither way;
  • understanding of texts requires a wealth of external, and quite abstract-concepts, that appear as significant only upon interpretation, as well as a social embedding of mutual interpretation,.

This list should at least exclude any attempt to defend the empirico-statistical approach as a reasonable one. Except the fact that it conveys a better-than-nothing attitude. These brings us to the question of utility.

Engineers build machines that are supposedly useful, more exactly, they are intended to be fulfill a particular purpose. Mostly, however, machines, even any technology in general, is useful only upon processes of subjective appropriation. The most striking example for this is the car. Else, computers have evolved not for reasons of utility, but rather for gaming. Video did not become popular for artistic reasons or for commercial ones, but due to the possibilities the medium offered for the sex industry. The lesson here being that an intended purpose is difficult to achieve as of the actual usage of the technology. On the other hand, every technology may exert some gravitational forces to develop a then unintended symbolic purpose and regarding that even considerable value. So, could we agree that the classification of texts as it is performed by contemporary technology is useful?

Not quite. We can’t regard the classification of texts as it is possible with the empirico-statistical approach as a reasonable technology. For the classification of texts can’t be separated from their understanding. All we can accomplish by this approach is to filter out those texts that do not match our interests with a sufficiently high probability. Yet, for this task we do not need text classification.

Architectures like 3L-SOM could also be expected to play an important role in translation, as translation requires even deeper understanding of texts as it is needed for sorting texts according to a template.

Besides the necessity for this doubly recurrent scheme we haven’t said much so far here about how then actually to treat the text. Texts should not be mistaken as empiric data. That means that we have to take a modified stance regarding measurement itself. In several essays we already mentioned the conceptual advantages of the two-layered (TL) approach based on self-organizing maps (TL-SOM). We already described in detail how the TL-SOM works, including the the basic preparation of the random graph as it has been described by Kohonen.

The important thing about TL-SOM is that it is not a device for modeling the similarity of texts. It is just a representation, even as it is a very powerful one, because it is based on probabilistic contexts (random graphs). More precisely, it is just one of many possible representations, even as it is much more appropriate than n-gram and other jokes. We even should NOT consider the TL-SOM as so-called “unsupervised modeling”, as the distinction between unsupervised vs. supervised is just another myth (=nonsense if it comes to quantitative models). The TL-SOM is nothing else than an instance for associative storage.

The trick of using a random graph (see the link above) is that the surrounds of words are differentially represented as well. The Kohonen model is quite scarce in this respect, since it applies a completely neutral model. In fact, words in a text are represented as if they would be all the same: of the same kind, of the same weight, etc. That’s clearly not reasonable. Instead, we should represent a word in several, different manners into the same SOM.

Yet, the random graph approach should not be considered just as a “trick”. We repeatedly argued (for instance here) that we have to “dissolve” empirical observations into a probabilistic (re)presentation in order to evade and to avoid the pseudo-problem of “symbol grounding”. Note that even by the practice of setting up a table in order to organize “data” we are already crossing the rubicon into the realm of the symbolic!

The real trick of the TL-SOM, however, is something completely different. The first layer represents the random graph of all words, the actual pre-specific sorting of texts, however, is performed by the second layer on the output of the first layer. In other words, the text is “renormalized”, the SOM itself is used as a measurement device. This renormalization allows to organize data in a standardized manner while allowing to avoid the symbolic fallacy. To our knowledge, this possible usage of the renormalization principle has not been recognized so far. It is indeed a very important principle that puts many things in order. We will deal later in a separate contribution with this issue again.

Only based on the associative storage taken as an entirety appropriate modeling is possible for textual data. The tremendous advantage of that is that the structure for any subsequent consideration now remains constant. We may indeed set up a table. The content of this table, the data, however is not derived directly from the text. Instead we first apply renormalization (a technique known from quantum physics, cf. [1])

The input is some description of the text completely in terms of the TL-SOM. More explicit, we have to “observe” the text as it behaves in the TL-SOM. Here, we are indeed legitimized to treat the text as an empirical observation, albeit we can, of course, observe the text in many different ways. Yet, observing means to conceive the text as a moving target, as a series of multitudes.

One of the available tools is Markov modeling, either as Markov chains, or by means of Hidden Markov Models. But there are many others. Most significantly, probabilistic grammars, even probabilistic phrase structure grammars can be mapped onto Markov models. Yet, again we meet the problem of apriori classification. Both models, Markovian as well as grammarian, need an assignment of grammatical type to a phrase, which often first requires understanding.

Given the autonomy of text, their temporal structure and the impossibility to apply apriori schematism, our proposal is that we just have to conceive of the text like we do of (higher) animals. Like an animal in its habitat, we may think of the text as inhabiting the TL-SOM, our associative storage. We can observe paths, their length and form, preferred neighborhoods, velocities, size and form of habitat.

Similar texts will behave in a similar manner. Such similarity is far beyond (better: as if from another planet) the statistical approach. We also can see now that the statistical approach is being trapped by the representationalist fallacy. This similarity is of course a relative one. The important point here is that we can describe texts in a standardized manner strictly WITHOUT reducing their content to statistical measures. It is also quite simple to determine the similarity of texts, whether as a whole, or whether regarding any part of it. We need not determine the range of our source at all apriori to the results of modeling. That modeling introduces a third logical layer. We may apply standard modeling, using a flexible tool for transformation and a further instance of a SOM, as we provide it as SomFluid in the downloads. The important thing is that this last step of modeling has to run automatically.

The proposed structure keeps any kind of reference completely intact. It also draws on its collected experience, that is, all texts it have been digesting before. It is not necessary to determine stopwords and similar gimmicks. Of course, we could, but that’s part of the conversation. Just provide an example of any size, just as it is available. Everything from two words, to a sentence, to a paragraph, to the content of a directory will work.

Such a 3L-SOM is very close to what we reasonably could call “understanding texts”. But does it really “understand”?

As such, not really. First, images should be stored in the same manner (!!), that is, preprocessed as random graphs over local contexts of various size, into the same (networked population of) SOM(s). Second, a language production module would be needed. But once we have those parts working together, then there will be full understanding of texts.

(I take any reasonable offer to implement this within the next 12 months, seriously!)

Conclusion

Understanding is a faculty to move around in a world of symbols. That’s not meant as a trivial issue. First, the world consists of facts, where facts comprise an universe of dynamic relations. Symbols are just not like traffic signs or pictograms as these belong to the more simple kind of symbols. Symbolizing is a complex, social, mediatized diachronic process.

Classifying, understood as “performing modeling and applying models” consists basically of two parts. One of them could be automated completely, while the other one could not treated by a finite or apriori definable set of rules at all: setting the purpose. In the case of texts, classifying can’t be separated from understanding, because the purpose of the text emerges only upon interpretation, which in turn requires a manifold of modeling raids. Modeling a (quasi-)physical system is completely different from that, it is almost trivial. Yet, the structure of a 3L-SOM could well evolve into an arrangement that is capable to understand in a similar way as we humans do. More precisely, and a bit more abstract, we also could say, that a “system” based on a population of 3L-SOM once will be able to navigate in the choreostemic space.

References
  • [1] B. Delamotte (2003). A hint of renormalization. Am.J.Phys. 72 (2004) 170-184, available online: arXiv:hep-th/0212049v3.

۞

Advertisements

SOM = Speedily Organizing Map

February 12, 2012 § Leave a comment

The Self-organizing Map is a powerful and high-potential computational procedure.

Yet, there is no free lunch, especially not for procedures that are able to deliver meaningful results.

The self-organizing map is such a valuable procedure, we have discussed its theoretical potential with regard to a range of different aspects in other chapters. Here, we want not to deal further with such theoretical or even philosophical issues, e.g. related to the philosophy of mind, instead we focus the issue of performance, understood simply in terms of speed.

For all those demo SOMs the algorithmic time complexity is not really an issue. The algorithm approximates rather quickly to a stable state. Yet, small maps—where “small” means something like less than 500 nodes or so—are not really interesting. It is much like in brains. Brains are made largely from neurons and some chemicals, and a lot of them can do amazing things. If you take 500 of them you may stuff a worm in an appropriate manner, but not even a termite. The important questions thus are, beyond the nice story about theoretical benefits.

What happens with the SOM principle if we connect 1’000’000 nodes?

How to organize 100, 1000 or 10’000 of such million-nodes SOMs?

By these figures we would end up with somewhat around 1..10 billions of nodes1, all organized along the same principles. Just to avoid a common misunderstanding here: these masses of neurons are organized in a very similar manner, yet the totality of them builds a complex system as we have described it in our chapter about complexity. There are several, if not many emergent levels, and a lot of self-referentiality. These 1 billion nodes are not all engaged with segmenting external data! We will see elsewhere, in the chapter about associative storage and memory, how such deeply integrated modular system could be conceived of. There are some steps to take, though not terribly complicated or difficult ones.

When approaching such scales, the advantage of the self-organization turns palpably into a problematic disadvantage. “Self-organizing” means “bottom-up,” and this bottom-up direction in SOMs is represented by the fact that all records representing the observations have repeatedly to be compared to all nodes in the SOM in order to find the so-called “best matching unit” (BMU). The BMU is just that node in the network that exhibits an intensional profile that is the most similar among all the other profiles2. Though the SOM avoids to compare all records to all records, its algorithmic complexity scales as a power-function with respect to its own scale! Normally, algorithms are dependent on the size of the data, but not to its own “power.”

In its naive form the SOM shows a complexity of something like O(n w m2), where n is the amount of data (number of records, size of feature set), w the number of nodes to be visited for searching the BMU, and m2 the number of nodes affected in the update procedure. w and m are scaled by factors f1,f2 <1, but the basic complexity remains. The update procedure affects an area that is dependent on the size of the SOM, therefore the exponent. The exact degree of algorithmic complexity is not absolutely determined, since it depends on the dynamics in the learning function, among other things.

The situation worsens significantly if we apply improvements to the original flavor of the SOM, e.g.

  • – the principle of homogenized variance (per variable across extensional containers),
  • – in targeted modeling, tracking the explicit classification performance per node on the level of records, which means that the data has to be referenced
  • – size balancing of nodes,
  • – morphological differentiation like growing and melting, as in the FluidSOM, which additionally allows for free ranging nodes,
  • – evolutionary feature selection and creating proto-hypothesis,
  • – dynamic transformation of data,
  • – then think about the problem of indeterminacy of empiric data, which enforces differential modeling, i.e. a style of modeling that is performing experimental investigation of the dependency of the results on the settings (the free parameters) of the computational procedure: sampling the data, choosing the target, selecting a resolution for the scale of the resulting classification, choosing a risk attitude, among several more.

All affects the results of modeling, that is the prognostic/diagnostic/semantic conclusions one could draw from the modeling. Albeit all these steps could be organized based on a set of rules, including applying another instance of a SOM, and thus could be run automatically, all of these necessary explorations require separate modeling. It is quite easy to set up an exploration plan for differential modeling that would require several dozens of models, and if evolutionary optimization is going to be applied, 100s if not thousands of different maps have to be calculated.

Fortunately, the SOM offers a range of opportunities for using dynamic look-up tables and parallel processing. A SOM consisting of 1’000’000 neurons could easily utilize several thousand threads, without much worries about concurrency (or the collisions of parallel threads). Unfortunately, such computers are not available yet, but you got the message…

Meanwhile we have to apply optimization through dynamically built look-up tables.  These I will describe briefly in the following sections.

Searching the Most Similar Node

An ingredient part of speeding up the SOM in real-life application is an appropriate initialization of the intensional profiles across the map. Of course, precisely this can not be known in advance, at least not exactly. The self-organization of the map is the shortest path to its final state, there is no possibility for an analytic short-cut. Kohonen proposes to apply Principal Component Analysis (PCA) for calculating the initial values. I am convinced that this is not a good idea. The PCA is deeply incompatible with the SOM, hence it will respond to very different traits in the data. PCA and SOM behave similar only in the case of demo cases…

Preselecting the Landing Zone

A better alternative is the SOM itself. Since the mapping organized by the SOM is preserving the topology of the data, we could apply a much smaller SOM, even a nested series of down-scaled SOMs to create a coarse model for selecting the appropriate sub-population in the large SOM. The steps are the following:

  • 1. create the main SOM, say 40’000 nodes, organized on a square rectangle, where the sides are of the relative length of 200 nodes;
  • 2. create a SOM for preselecting the landing zone, scaled approximately 14 by 14 nodes, and use the same structure (i.e. the same feature vectors) as for the large SOM;
  • 3. Prime the small SOM with a small but significant sample of the data, which comprise say 2000..4000 records in this case of around 200 nodes; draw this sample randomly from the data; this step should complete comparatively quick (by a factor of 200 in our example);
  • 4. initialize the main SOM by a blurred (geometric) projection of the intensional profiles from the minor to the larger SOM;
  • 5. now use the minor SOM as a model for the selection of the landing zone, simply by means of geometric projection.

As a result, the number of nodes to be visited in the large SOM in order to find the best match remains almost constant.
There is an interesting correlate to this technique. If one needs a series of SOM based representations of the data distinguished just by the maximum number of nodes in the map, one should always start with the lowest, i.e. most coarse resolution, with the least number of nodes. The results then can be used as a projective priming of the SOM on the next level of resolution.

Explicit Lookup Tables linking Observations to SOM areas

In the previous section we described the usage of a secondary much smaller SOM as a device for limiting the number of nodes to be scanned. The same problematics can be addressed by explicit lookup tables that establish a link between a given record and a vicinity around the last (few) best matching units.

If the SOM is approximately stable, that is, after the SOM has seen a significant portion of the data, it is not necessary any more to check the whole map. Just scan the vicinity around the last best matching node in the map. Again, the number of nodes necessary to be checked is almost constant.

The stability can not be predicted in advance, of course. The SOM is, as the name says, self-organizing (albeit in a weak manner). As a rule of thumb one could check the average number of observations attached to a particular node, the average taken across all nodes that contain at least one record. This average filling should be larger than 8..10 (due to considerations based on variance, and some arguments derived from non-parametric statistics… but it is a rule of thumb).

Large Feature Vectors

Feature vectors can be very large. In life sciences and medicine I experienced cases with 3000 raw variables. During data transformation this number can increase to 4000..6000 variables. Te comparison of such feature vectors is quite expensive.

Fortunately, there are some nice tricks, which are all based on the same strategy. This strategy comprises the following steps.

  • 1. create a temporary SOM with the following, very different feature vector; this vector has just around 80..100 (real-valued) positions and 1 position for the index variable (in other words, the table key); such the size of the vector is a 60th of the original vector, if we are faced with 6000 variables.
  • 2. create the derived vectors by encoding the records representing the observation by a technique that is called “random projection”; such a projection is generated by multiplying the data vector with a token from a catalog of (labeled) matrices, that are filled with uniform random numbers ([0..1]).
  • 3. create the “random projection” SOM based on these transformed records
  • 4. after training, replace the random projection data with real data, re-calculate the intensional profiles accordingly, and run a small sample of the data through the SOM for final tuning.

The technique of random projection has been invented in 1988. The principle works because of two facts:

  • (1) Quite amazingly, all random vectors beyond a certain dimensionality (80..200, as said before) are nearly orthogonal to each other. The random projection compresses the original data without loosing the bits of information that are distinctive, even if they are not accessible in an explicit manner any more.
  • (2) The only trait of the data that is considered by the SOM is their potential difference.

Bags of SOMs

Finally, one can take advantage of splitting the data into several smaller samples. These samples require only smaller SOMs, which run much faster (we are faced with a power law). After training each of the SOMs, they can be combined into a compound model.

This technique is known as bagging in Data Mining. Today it is also quite popular in the form of so-called random forests, where instead of one large decision tree man smaller ones are built and then combined. This technique is very promising, since it is a technique of nature. Its simply modularization on an abstract level, leading the net level of abstraction in a seamless manner. It is also one of our favorite principles for the overall “epistemic machine”.

Notes

1. This would represent just around 10% of the neurons of our brain, if we interpret each node as a neuron. Yet, this comparison is misleading. The functionality of a node in a SOM rather represents a whole population of neurons, although there is no 1:1 principle transferable between them. Hence, such a system would be roughly of the size of a human brain, and much more important, it is likely organized in a comparable, albeit rather alien, manner.

2. Quite often, the vectors that are attached to the nodes are called weight vectors. This is a serious misnomer, as neither the nodes are weighted by this vector (alone), nor the variables that make up that vector (for more details see here). Conceptually it is much more sound to call those vectors “intensional profiles.” Actually, one could indeed imagine  a weight vector that would control the contribution (“weight”) of variables to the similarity / distance between two of such vectors. Such weight vectors could even be specific for each of the nodes.

References…

  • [1]

.

۞

The Self-Organizing Map – an Intro

October 20, 2011 § Leave a comment

A Map that organizes itself:

Is it the paradise for navigators, or is it the hell?

Well, it depends, I would say. As a control freak, or a warfarer like Shannon in the early 1940ies, you probably vote for the hell. And indeed, there are presumably only very few entities that have been so strongly neglected by information scientists like it was the case for the self-organizing map. Of course, there are some reasons for that. The other type of navigator probably enjoying the SOM is more likely of the type Odysseus, or Baudolino, the hero in Umberto Eco’s novel of the same name.

More seriously, the self-organizing map (SOM) is a powerful and even today (2011) a still underestimated structure, though meanwhile rapidly gaining in popularity. This chapter serves as a basic introduction into the SOM, along with a first discussion of the strength and weaknesses of its original version. Today, there are many versions around, mainly in research; the most important ones I will briefly mention at the end. It should be clear that there are tons of articles around in the web. Yet, most of them focus on the mathematics of the more original versions, but do not describe or discuss the architecture itself, or even provide a suitable high-level interpretation of what is going on in a SOM. So, I will not repeat the mathematics, instead I will try to explain it also for non-engineers without using mathematical formulas. Actually, the mathematics is not the most serious thing in it anyway.

Brief

The SOM is a bundle comprising a mathematical structure and a particularly designed procedure for feeding multi-dimensional (multi-attributes) data into it that are prepared as a table. Numbers of attributes can reach tens of thousands. Its purpose is to infer the best possible sorting of the data in a 2(3) dimensional grid. Yet, both preconditions, dimensionality and data as table, are not absolute and may be overcome by certain extensions to the original version of the SOM. The sorting process groups more similar records closer together. Thus we can say that a body of high-dimensional data (organized as records from a table) are mapped onto 2 dimensions, thereby enforcing a weighting of the properties used to describe (essentially: create) the records.

The SOM can be parametrized such that it is a very robust method for clustering data. The SOM exhibits an interesting duality, as it can be used for basic clustering as well as for target oriented predictive modeling. This duality opens interesting possibilities for realizing a pre-specific associative storage. The SOM is particularly interesting due to its structure and hence due to its extensibility, properties that other most methods do not share with the SOM. Though substantially different to other popular structures like Artificial Neural Networks, the SOM may be included into the family of connectionist models.

History

The development leading finally to the SOM started around 1973 in a computer science lab at the Helsinki university. It was Teuvo Kohonen who got aware to certain memory effects of correlation matrices. Until 1979, when he first published the principle of the Self-Organizing Map, he dedicatedly adopted principles known from the human brain. A few further papers followed and a book about the subject in 1983. Then, the SOM wasn’t readily adapted for at least 15 years. Its direct competitor for acceptance, the Backpropagation Artificial Neural network (B/ANN), was published in 1985, after the neural networks have been rediscovered in physics, following investigations of spin glasses and certain memory effects there. Actually, the interest in simulating neural networks dates back to 1941, when von Neumann, Hebb, McCulloch, and also Pitts, among others, met at a conference on the topic.

For a long time the SOM wasn’t regarded as a “neural network,” and this has been considered being a serious disadvantage. The first part of the diagnosis indeed was true: Kohonen never tried to simulate individual neurons, as it was the goal for all simulations of ANN. The ANN research has been deeply informed by physics, cybernetics and mathematical information theory. Simulating neurons is simply not adequate, it is kind of silly science. Above all, most ANN are just a very particular type of a “network” as there are no connections within a particular layer. In contrast, Kohonen tried to grasp a more abstract level: the population of neurons. In our opinion this choice is much more feasible and much more powerful as well. In particular, SOM can not only represent “neurons,” but any population of entities which can exchange and store information. More about that in a moment.

Nowadays, the methodology of SOM can be rated as well adopted. More than 8’000 research papers have been published so far, with increasing momentum, covering a lot of practical domains and research areas. Many have demonstrated the superiority or greater generality of SOM as compared to other methods.

Mechanism

The mechanism of a basic SOM is quite easy o describe, since there are only a few ingredients.
First, we need data. Imagine a table, where observations are listed in rows, and the column headers describe the variables that have been measured for each observed case. The variables are also called attributes, or features. Note, that in the case of the basic (say, the Kohonen-) SOM the structure given by the attributes is the same for all records. Technically, the data have to be normalized per column such that the minimum value is 0 and the maximum value is 1. Note, that this normalization ensures comparability of different sub-sets of observations. It represents just the most basic transformation of data, while there are many other transformation possible: logarithmic re-scaling of values of a column in order to shift the mode of the empirical distribution, splitting a variable by value, binarization, scaling of parameters that are available only on nominal niveau, or combining two or several columns by a formula are further examples (for details please visit the chapter about modeling). In fact, the transformation of data (I am not talking here about the preparation of data!) is one of the most important ingredients for successful predictive modeling.

Second, we create the SOM. Basically, and its simplest form, the SOM is a grid, where each cell has 4 (squares, rectangles) or 6 edges (hexagonal layout). The grid consists from nodes and edges. Nodes serve as a kind of container, while edges work as a kind of fibers for spreading signals. In some versions of the SOM the nodes can range freely or they can randomly move around a little bit.

An important element of the architecture of a SOM now is that each node gets the same structure assigned as we know from the table. As a consequence, the vectors collected in the nodes can easily be compared by some function (just wait a second for that). In the beginning, each node get randomly initialized. Then the data are fed into the SOM.

This data feeding is organized as follows. A randomly chosen record is taken from the table and then compared to all of the nodes. There is always a best matching node. The record then gets inserted into this node. Upon this insertion, which is kind of hiding, the values in the nodes structure vector are recalculated, e.g. as the (new) mean for all values across all records collected in a node (container). The trick now is not to change just the winning node where the data record has been inserted, but all nodes of the the close surround also, though with a strength that decreases with the distance.

This small activity of searching the best matching node, insertion and information spreading is done for all records, and possibly also repeated. The spreading of information to the neighbor nodes is a crucial element in the SOM mechanism. This spreading is responsible for the self-organization. It also represents a probabilistic coupling in a network. Of course, there are some important variants to that, see below, but basically that’s all. Ok, there is some numerical bookkeeping, optimizations to search the winning nodes etc. but these measures are not essential for the mechanism.

As a result one will find similar records in the same node, or a the direct neighbors. It has been shown that the SOM is topology preserving, that is, the SOM is smooth with regard to the similarity of neighbor nodes. The data records inside the nodes are a list, which is described by the node’s value vector. That value vector could be said to represent a class, or intension, which is defined by its empirical observations, the cases, or extension.

After feeding all data to the SOM the training has been finished. For SOM, it is easy to run in a continuous mode, where the feed of incoming data is not “stopping” at any time. Now the SOM can be used to classify new records. A new record simply needs to be compared to the nodes of the SOM, i.e. to the value vector of the nodes, but NOT to all the cases (SOM is not case-based reasoning, but type-based reasoning!). If the records contained a marker attribute, e.g. indicating the quality of the record, you will also get the expected quality for a new record of unknown quality.

Properties of the SOM

The SOM belongs to the class of clustering algorithms. It is very robust against missing values in the table, and unlike many other methods it does NOT require any settings regarding the clusters, such as size or number. Of course, this is a great advantage and a property of logical consistency. Nodes may remain empty, while the node value vector of the empty node is well-defined. This is a very important feature of the SOM, as this represents the capability to infer potential yet unseen observations. No other method is able to behave like this. Other properties can be invoked by means of possible extensions of the basic mechanism (see below)

As already said, nodes collect similar records of data, where a record represents a single observation. It is important to understand, that a node does not equal to a cluster. In our opinion, it does not make much sense to draw boundaries around one or several nodes and so  proposing a particular “cluster.” This boundary should be set only upon an external purpose. Inversely, without such a purpose, it is sense-free to conceive of a trained SOM as a model. AT best, it would represent a pre-specific model, which however is a great property of the SOM to be able to create such.

The learning is competitive, since different nodes compete for a single record. Yet, it is also cooperative, since the upon an insert operation information is exchanged between neighbor nodes.

The reasoning of the SOM is type-based, which is much more favorable than case-based reasoning. It is also more flexible than ANN, which just provide a classification, but no distinction between extension and intension is provided. SOM, but not ANN, can be used in two very different modes. Either just for clustering or grouping individual observations without any further assumptions, and secondly for targeted modeling, that is for establishing a predictive/ diagnostic link between several (or many) basic input variables and one (or several) target variable(s) that represent the outcome of a process according to experience. Such a double usage of the same structure is barely accessible for any other associative structure.

Another difference is that ANN are much more suitable to approximate single analytic functions, while SOM are suitable for general classification tasks, where the parameter space and/or the value (outcome) space could even be discontinuous or folded.

A large advantage over many other methods is that the similarity function and the cost function is explicitly accessible. For ANN, SVM or statistical learning this is not possible. Similarly, the SOM automatically adapts its structure according to the data, i.e. it is also possible to change the method within the learning process, adaptively and self-organized.

As a result we can conclude that the design of the SOM method is much more transparent than that of any other of the advanced methods.

Competing Comparable Methods

SOM are either more robust, more general or more simple than any other method, while the quality of classification is closely comparable. Among those competing methods are artificial neural networks (ANN), principal component analysis (PCA), multi-dimensional scaling (MDS), or adaptive resonance theory network (ART). Important ideas of ART networks can be merged with the SOM principle, keeping the benefits of both. PCA and MDS are based on statistical correlation analysis (covariance matrices), i.e. they are importing all the assumptions and necessary precondition of statistics, namely the independence of observational variables. Yet, it is the goal to identify such dependencies, thus it is not quite feasible to presuppose that! SOM do not know such limitations from strange assumptions; else, recently it has been proven that SOM are just a generalization of PCA.

Of course, there are many other methods, like Support Vector Machines (SVM) with statistical kernels, or tree forests; yet, these methods are purely statistical in nature, with no structural part in it. Else, they do not provide access to the similarity function as it is possible for the SOM.

A last word about the particular difference between ANN and SOM. SOM are true symmetrical networks, where each unit has its own explicit memory about observations, while the linkage to other units on the same level of integration is probabilistic. That means, that the actual linkage between any two units can be changed dynamically within the learning process. In fact, a SOM is thus not a single network like a fisher net, it is much more appropriate to conceive them as a representation of a manifold of networks.

Contrary to those advanced structural properties, the so-called Artificial Neural Networks are explicit directional networks.Units represent individual neurons and do not have storage capacities. Each unit does not know anything about things like observations.  Conceptually, these units are thus on a much lower level than the units in a SOM. In ANN they can not have “inner” structure. The same is true for for the links between the units. Since they have to be programmed in an explicit manner (which is called “architecture”), the topology of the connections can not be changed during learning at runtime of the program.

In ANN information flows through the units in a directed manner (as in case of natural neurons). It is there almost impossible to create an equally dense connectivity within a single layer of units as in SOM. As a consequence, ANN do not show the capability for self-organization.

Taken as whole, ANN seem to be under the representationalist delusion. In order to achieve the same general effects  and abstract phenomena as the SOM are able to, very large ANN would be necessary. Hence, pure ANN are not really a valid alternative for our endeavor. This does not rule out the possibility to use them as components within a SOM or between SOMs.

Variants and Architectures

Here are some SOM extensions and improvements of the SOM.

Homogenized Extensional Diversity
The original version of the SOM tends to collect “bad” records, those not matching well anywhere else, into a single cluster, even if the records are not similar at all. In this case it is not allowed to compare nodes any more, since the internal variance is not comparable anymore and the mean/variance on the level of the node would not describe the internal variance on the level of the collected records any more. The cure for that misbehavior is rather simple. The cost function controlling the matching of a record to the most similar node needs to contain the variability within the set of records (extension of the type represented by the node) collected by the node. Else, merging and splitting of nodes as described for structural learning helps effectively. In scientific literature, there is yet no reference for this extension of the SOM.

Structural Learning
One of the most basic extensions to the original mechanism is to allow for splitting and merging of nodes according to some internal divergence criterion.  A SOM made from such nodes is able to adopt structurally to the input data, not just statistically. This feature is inspired by the so-called ART-networks [1]. Similarly, merging and splitting of “nodes” of a SOM was proposed by [2], though not in the cue of ART networks.

Nested SOM
Since the SOM represents populations of neurons, it is easy and straightforward to think about nesting of SOM. Each node would contain a smaller SOM. A node may even contain any other parametrized method, such like Artificial Neural Networks. The node value vector then would not exhibit the structure of the table, but instead would display the parameters of the enclosed algorithm. One example for this is the so-called mn-SOM [3].

Growing SOM
Usually, data are not evenly distributed. Hence, some nodes grow much more than others. One way to cope with this situation is to automatically let the SOM grow. many different variants of growth could be thought of and some already has been implemented. Our experiments point into the direction of a “stem cell” analog.

Growing SOMs have first been proposed by [4], while [5] provides some exploratory implementation. Concerning growing SOM, it is very important to understand the concept (or phenomenon) of growth. We will discuss possible growth patterns and the  consequences for possible and new growing SOMs elsewhere. Just for now we can say that any kind of SOM structure can grow and/or differentiate.

SOM gas, mobile nodes
The name already says it: the topology of the grid making up the SOM is not fixed. Nodes even may range around quite freely, as in the case of SOM Gas.

Chemical SOM
Starting from mobile nodes, we can think about a small set of properties of nodes which are not directly given by the data structure. These properties can be interpreted as chemicals creating sth. like a Gray-Scott Reaction-Diffusion-Model, i.e. a self-organizing fluid dynamics. The possible effects are (i) a differential opacity of the SOM for transmitted information, or (ii) the differentiation into fibers and networks, or (iii) the optimization of the topological structure as a standard part of the life cycle of the SOM. The mobility can be controlled internally by means of a “temperature,” or expressed by a metaphor, the fixed SOM would melt partially. This may help to reorganize a SOM. In scientific literature, there is yet no reference for this extension of the SOM.

Evolutionary Embedded SOM with Meta-Modeling
SOM can be embedded into an evolutionary learning about the most appropriate selection of attributes. This can be extended even towards the construction of structural hypothesis about the data. While other methods could be also embedded in a similar manner, the results are drastically different, since most methods do not learn structurally. Coupling evolutionary processes with associative structures was proposed a long time ago by [6], albeit only in the context of optimization of ANN. While this is quite reasonable, we additionally propose to use evolution in a different manner and for different purposes (see the chapter about evolution)

[1] ART networks
[2] merging splitting of nodes
[3] The mn-SOM
[4] Growing SOM a
[5] Growing SOM b
[6] evolutionary optimization of artificial neural networks

۞

Where Am I?

You are currently browsing entries tagged with kohonen at The "Putnam Program".