The Text Machine
July 10, 2012 § Leave a comment
What is the role of texts? How do we use them (as humans)?
How do we access them (as reading humans)? The answers to such questions seem to be pretty obvious. Almost everybody can read. Well, today. Noteworthy, reading itself, as a performance and regarding its use, changed dramatically at least two times in history: First, after the invention of the vocal alphabet in ancient Greece, and the second time after book printing became abundant during the 16th century. Maybe, the issue around reading isn’t so simple as it seems in everyday life.
Beyond such accounts of historical issues and basic experiences, we have a lot of more theoretical results concerning texts. Beginning with Friedrich Schleiermacher who was the first to identify hermeneutics as a subject around 1830 and formulated it in a way that has been considered as more complete and powerful than the version proposed by Gadamer in the 1950ies. Proceeding of course with Wittgenstein (language games, rule following), Austin (speech act theory) or Quine (criticizing empirism). Philosophers like John Searle, Hilary Putnam and Robert Brandom then explicating and extending the work of the former heroes. And those have been accompanied by many others. If you wonder about linguistics missing here, well, then because linguistics does not provide theories about language. Today, the domain is largely caught by positivism and the corresponding analytic approach.
Here in his little piece we pose these questions in the context of certain relations between machines and texts. There are a lot of such relations, and even quite sophisticated or surprising ones. For instance, texts can be considered as kind of machines. Yet, they bear a certain note of (virtual) agency as well, resulting in a considerable non-triviality of this machine aspect of texts. Here we will not deal with this perspective. Instead, we just will take a look on the possibilities and the respective practices to handle or to “treat” texts with machines. Or, if you prefer, the treating of texts by machines, as far as a certain autonomy of machines could be considered as necessary to deal with texts at all.
Today, we can find a fast growing community of computer programmers that are dealing with texts as kind of unstructured information. One of the buzz-words is the so-called “semantic web”, another one is “sentiment analysis”. We won’t comment in any detail about those movements, because they are deeply flawed. The first one is trying to formalize semantics and meaning apriori, trying to render the world into a trivial machine. We repeatedly criticized this and we agree herein with Douglas Hofstadter. (see this discussion of his “Fluid Analogy”). The second is trying to identify the sentiment of a text or a “tweet”, e.g. about a stock or an organization, on the basis of statistical measures about keywords and their utterly naive “n-grammed” versions, without actually paying any notice to the problem of “understanding”. Such nonsense would not be as widespread if programmers would read only a few fundamental philosophical texts about language. In fact, they don’t, and thus they are condemned to visit any of the underdeveloped positions that arose centuries ago.
If we neglect the social role of texts for a moment, we might identify a single major role of texts, albeit we have to describe it then in rather general terms. We may say that the role of a text, as a specimen of many other texts from a large population, is its functioning as a medium for the externalization of mental content in order to serve the ultimate purpose, which consists of the possibility for a (re)construction of resembling mental content on the side of the interpreting person.
This interpretation is a primacy. It is not possible to assign meaning to text like a sticky note, then putting the text including the yellow sticky note directly into the recipients brain. That may sound silly, but unfortunately it’s the “theory” followed by many people working in the computer sciences. Interpretation can’t be controlled completely, though, not even by the mind performing it, not even by the same mind who seconds before externalized the text through writing or speaking.
Now, the notion of mental content may seem both quite vague and hopelessly general as well. Yet, in the previous chapter we introduced a structure, the choreostemic space, which allows to speak pretty precise about mental content. Note that we don’t need to talk about semantics, meaning or references to “objects” here. Mental content is not a “state” either. Thinking “state” and the mental together is much on the same stage as to seriously considering the existence of sea monsters in the end of 18th century, when the list science of Linnaeus was not yet reshaped by the upcoming historical turn in the philosophy of nature. Nowadays we must consider it as silly-minded to think about a complex story like the brain and its mind by means of “state”. Doing so, one confounds the stability of the graphical representation of a word in a language with the complexity of a multi-layered dynamic process, spanned between deliberate randomness, self-organized rhythmicity and temporary thus preliminary meta-stability.
The notion of mental content does not refer to the representation of referenced “objects”. We do not have maps, lists or libraries in our heads. Everything which we experience as inner life builds up from an enormous randomness through deep stacks of complex emergent processes, where each emergent level is also shaped from top-down, implicitly and, except the last one usually called “consciousness,” also explicitly. The stability of memory and words, of feelings and faculties is deceptive, they are not so stable at all. Only their externalized symbolic representations are more or less stable, their stability as words etc. can be shattered easily. The point we would like to emphasize here is that everything that happens in the mind is constructed on the fly, while the construction is completed only with the ultimate step of externalization, that is, speaking or writing. The notion of “mental content” is thus a bit misleading.
The mental may be conceived most appropriately as a manifold of stacked and intertwined processes. This holds for the naturalist perspective as well as for the abstract perspective, as he have argued in the previous chapter. It is simply impossible to find a single stable point within the (abstract) dynamics between model, concept, mediality and virtuality, which could be thought of as spanning a space. We called it the choreostemic space.
For the following remarks about the relation between text and machines and the practitioners engaged in building machines to handle texts we have to keep in mind just those two things: (i) there is a primacy of interpretation, (ii) the mental is a non-representative dynamic process that can’t be formalized (in the sense of “being represented” by a formula).
In turn this means that we should avoid to refer to formulas when going to build a “text machine”. Text machines will be helpful only if their understanding of texts, even if it is a rudimentary understanding, follows the same abstract principles as our human understanding of texts does. Machines pretending to deal with texts, but actually only moving dead formal symbols back and forth, as it is the case in statistical text mining, n-gram based methods and similar, are not helpful at all. The only thing that happens is that these machines introduce a formalistic structure into our human life. We may say that these techniques render humans helpful to machines.
Nowadays we can find a whole techno-scientific community that is engaged in the field of machine learning, devised to “textual data”. The computers are programmed in such a way that they can be used to classify texts. The idea is to provide some keywords, or anti-words, or even a small set of sample texts, which then are taken by the software as a kind of template that is used to build a selection model. This model then is used to select resembling texts from a large set of texts. We have to be very clear about the purpose of these software programs: they classify texts.
The input data for doing so is taken from the texts themselves. More precisely, they are preprocessed according to specialized methods. Each of the texts gets described by a possibly large set of “features” that have been extracted by these methods. The obvious point is that the procedure is purely empirical in the strong sense. Only the available observations (the texts) are taken to infer the “similarity” between texts. Usually, not even linguistic properties are used to form the empirical observations, albeit there are exceptions. People use the so-called n-gram approach, which is only little more than counting letters. It is a zero-knowledge model about the series of symbols, which humans interpret as text. Additionally, the frequency or relative positions of keywords and anti-words are usually measured and expressed by mostly quite simple statistical methods.
Well, classifying texts is something that is quite different from understanding texts. Of course. Yet, said community tries to reproduce the “classification” achieved or produced by humans. Such, any of the engineers of the field of machine learning directed to texts implicitly claims kind of an understanding. They even organize competitions.
The problems with the statistical approach are quite obvious. Quine called it the dogma of empiricism and coined the Gavagai anecdote about it, which even provides much more information than the text alone. In order to understand a text we need references to many things outside the particular text(s) at hand. Two of those are especially salient: concepts and the social dimension. Straightly opposite to the believe of positivists, concepts can’t be defined in advance to a particular interpretation. Using catalogs of references does not help much, if these catalogs are used just as lists of references. The software does not understand “chair” by the “definition” stored in a database, or even by the set of such references. It simply does not care whether there are encoded ASCII codes that yield the symbol “chair” or the symbol “h&e%43”. Douglas Hofstadter has been stressing this point over and over again, and we fully agree to that.
From that necessity to a particular and rather wide “background” (notion by Searle) the second problem derives, which is much more serious, even devastating to the soundness of the whole empirico-statistical approach. The problem is simple: Even we humans have to read a text before being able to understand it. Only upon understanding we could classify it. Of course, the brain of many people is trained sufficiently as to work about the relations of the texts and any of its components while reading the text. The basic setup of the problem, however, remains the same.
Actually, what is happening is a constantly repeated re-reading of the text, taking into account all available insights regarding the text and the relations of it to the author and the reader, while this re-reading often takes place in the memory. To perform this demanding task in parallel, based on the “cache” available from memory, requires a lot of experience and training, though. Less experienced people indeed re-read the text physically.
The consequence of all of that is that we could not determine the best empirical discriminators for a particular text in-the-reading in order to select it as-if we would use a model. Actually, we can’t determine the set of discriminators before we have read it all, at least not before the first pass. Let us call this the completeness issue.
The very first insight is thus that a one-shot approach in text classification is based on a misconception. The software and the human would have to align to each other in some kind of conversation. Otherwise it can’t be specified in principle what the task is, that is, which texts should actually be selected. Any approach to text classification not following the “conversation scheme” is necessarily bare nonsense. Yet, that’s not really a surprise (except for some of the engineers).
There is a further consequence of the completeness issue. We can’t set up a table to learn from at all. This too is not a surprise, since setting up a table means to set up a particular symbolization. Any symbolization apriori to understanding must count as a hypothesis. Such simple. Whether it matches our purpose or not, we can’t know before we didn’t understand the text.
However, in order to make the software learning something we need assignates (traditionally called “properties”) and some criteria to distinguish better models from less performant models. In other words, we need a recurrent scheme on the technical level as well.
That’s why it is not perfectly correct to call texts “unstructured data”. (Besides the fact that data are not “out there”: we always need a measurement device, which in turn implies some kind of model AND some kind of theory.) In the case of texts, imposing a structure onto a text simply means to understand it. We even could say that a text as text is not structurable at all, since the interpretation of a text can’t never be regarded as finished.
All together, we may summarize the issue of complexity of texts as deriving from the following properties in the following way:
- – there are different levels of context, which additionally stretch across surrounds of very different sizes;
- – there are rich organizational constraints, e.g. grammars
- – there is a large corpus of words, while any of them bears meaning only upon interpretation;
- – there is a large number of relations that not only form a network, but which also change dynamically in the course of reading and of interpretation;
- – texts are symbolic: spatial neighborhood does not translate into reference, in neither way;
- – understanding of texts requires a wealth of external, and quite abstract-concepts, that appear as significant only upon interpretation, as well as a social embedding of mutual interpretation,.
This list should at least exclude any attempt to defend the empirico-statistical approach as a reasonable one. Except the fact that it conveys a better-than-nothing attitude. These brings us to the question of utility.
Engineers build machines that are supposedly useful, more exactly, they are intended to be fulfill a particular purpose. Mostly, however, machines, even any technology in general, is useful only upon processes of subjective appropriation. The most striking example for this is the car. Else, computers have evolved not for reasons of utility, but rather for gaming. Video did not become popular for artistic reasons or for commercial ones, but due to the possibilities the medium offered for the sex industry. The lesson here being that an intended purpose is difficult to achieve as of the actual usage of the technology. On the other hand, every technology may exert some gravitational forces to develop a then unintended symbolic purpose and regarding that even considerable value. So, could we agree that the classification of texts as it is performed by contemporary technology is useful?
Not quite. We can’t regard the classification of texts as it is possible with the empirico-statistical approach as a reasonable technology. For the classification of texts can’t be separated from their understanding. All we can accomplish by this approach is to filter out those texts that do not match our interests with a sufficiently high probability. Yet, for this task we do not need text classification.
Architectures like 3L-SOM could also be expected to play an important role in translation, as translation requires even deeper understanding of texts as it is needed for sorting texts according to a template.
Besides the necessity for this doubly recurrent scheme we haven’t said much so far here about how then actually to treat the text. Texts should not be mistaken as empiric data. That means that we have to take a modified stance regarding measurement itself. In several essays we already mentioned the conceptual advantages of the two-layered (TL) approach based on self-organizing maps (TL-SOM). We already described in detail how the TL-SOM works, including the the basic preparation of the random graph as it has been described by Kohonen.
The important thing about TL-SOM is that it is not a device for modeling the similarity of texts. It is just a representation, even as it is a very powerful one, because it is based on probabilistic contexts (random graphs). More precisely, it is just one of many possible representations, even as it is much more appropriate than n-gram and other jokes. We even should NOT consider the TL-SOM as so-called “unsupervised modeling”, as the distinction between unsupervised vs. supervised is just another myth (=nonsense if it comes to quantitative models). The TL-SOM is nothing else than an instance for associative storage.
The trick of using a random graph (see the link above) is that the surrounds of words are differentially represented as well. The Kohonen model is quite scarce in this respect, since it applies a completely neutral model. In fact, words in a text are represented as if they would be all the same: of the same kind, of the same weight, etc. That’s clearly not reasonable. Instead, we should represent a word in several, different manners into the same SOM.
Yet, the random graph approach should not be considered just as a “trick”. We repeatedly argued (for instance here) that we have to “dissolve” empirical observations into a probabilistic (re)presentation in order to evade and to avoid the pseudo-problem of “symbol grounding”. Note that even by the practice of setting up a table in order to organize “data” we are already crossing the rubicon into the realm of the symbolic!
The real trick of the TL-SOM, however, is something completely different. The first layer represents the random graph of all words, the actual pre-specific sorting of texts, however, is performed by the second layer on the output of the first layer. In other words, the text is “renormalized”, the SOM itself is used as a measurement device. This renormalization allows to organize data in a standardized manner while allowing to avoid the symbolic fallacy. To our knowledge, this possible usage of the renormalization principle has not been recognized so far. It is indeed a very important principle that puts many things in order. We will deal later in a separate contribution with this issue again.
Only based on the associative storage taken as an entirety appropriate modeling is possible for textual data. The tremendous advantage of that is that the structure for any subsequent consideration now remains constant. We may indeed set up a table. The content of this table, the data, however is not derived directly from the text. Instead we first apply renormalization (a technique known from quantum physics, cf. )
The input is some description of the text completely in terms of the TL-SOM. More explicit, we have to “observe” the text as it behaves in the TL-SOM. Here, we are indeed legitimized to treat the text as an empirical observation, albeit we can, of course, observe the text in many different ways. Yet, observing means to conceive the text as a moving target, as a series of multitudes.
One of the available tools is Markov modeling, either as Markov chains, or by means of Hidden Markov Models. But there are many others. Most significantly, probabilistic grammars, even probabilistic phrase structure grammars can be mapped onto Markov models. Yet, again we meet the problem of apriori classification. Both models, Markovian as well as grammarian, need an assignment of grammatical type to a phrase, which often first requires understanding.
Given the autonomy of text, their temporal structure and the impossibility to apply apriori schematism, our proposal is that we just have to conceive of the text like we do of (higher) animals. Like an animal in its habitat, we may think of the text as inhabiting the TL-SOM, our associative storage. We can observe paths, their length and form, preferred neighborhoods, velocities, size and form of habitat.
Similar texts will behave in a similar manner. Such similarity is far beyond (better: as if from another planet) the statistical approach. We also can see now that the statistical approach is being trapped by the representationalist fallacy. This similarity is of course a relative one. The important point here is that we can describe texts in a standardized manner strictly WITHOUT reducing their content to statistical measures. It is also quite simple to determine the similarity of texts, whether as a whole, or whether regarding any part of it. We need not determine the range of our source at all apriori to the results of modeling. That modeling introduces a third logical layer. We may apply standard modeling, using a flexible tool for transformation and a further instance of a SOM, as we provide it as SomFluid in the downloads. The important thing is that this last step of modeling has to run automatically.
The proposed structure keeps any kind of reference completely intact. It also draws on its collected experience, that is, all texts it have been digesting before. It is not necessary to determine stopwords and similar gimmicks. Of course, we could, but that’s part of the conversation. Just provide an example of any size, just as it is available. Everything from two words, to a sentence, to a paragraph, to the content of a directory will work.
Such a 3L-SOM is very close to what we reasonably could call “understanding texts”. But does it really “understand”?
As such, not really. First, images should be stored in the same manner (!!), that is, preprocessed as random graphs over local contexts of various size, into the same (networked population of) SOM(s). Second, a language production module would be needed. But once we have those parts working together, then there will be full understanding of texts.
(I take any reasonable offer to implement this within the next 12 months, seriously!)
Understanding is a faculty to move around in a world of symbols. That’s not meant as a trivial issue. First, the world consists of facts, where facts comprise an universe of dynamic relations. Symbols are just not like traffic signs or pictograms as these belong to the more simple kind of symbols. Symbolizing is a complex, social, mediatized diachronic process.
Classifying, understood as “performing modeling and applying models” consists basically of two parts. One of them could be automated completely, while the other one could not treated by a finite or apriori definable set of rules at all: setting the purpose. In the case of texts, classifying can’t be separated from understanding, because the purpose of the text emerges only upon interpretation, which in turn requires a manifold of modeling raids. Modeling a (quasi-)physical system is completely different from that, it is almost trivial. Yet, the structure of a 3L-SOM could well evolve into an arrangement that is capable to understand in a similar way as we humans do. More precisely, and a bit more abstract, we also could say, that a “system” based on a population of 3L-SOM once will be able to navigate in the choreostemic space.
-  B. Delamotte (2003). A hint of renormalization. Am.J.Phys. 72 (2004) 170-184, available online: arXiv:hep-th/0212049v3.