The Text Machine

July 10, 2012 § Leave a comment

What is the role of texts? How do we use them (as humans)?

How do we access them (as reading humans)? The answers to such questions seem to be pretty obvious. Almost everybody can read. Well, today. Noteworthy, reading itself, as a performance and regarding its use, changed dramatically at least two times in history: First, after the invention of the vocal alphabet in ancient Greece, and the second time after book printing became abundant during the 16th century. Maybe, the issue around reading isn’t so simple as it seems in everyday life.

Beyond such accounts of historical issues and basic experiences, we have a lot of more theoretical results concerning texts. Beginning with Friedrich Schleiermacher who was the first to identify hermeneutics as a subject around 1830 and formulated it in a way that has been considered as more complete and powerful than the version proposed by Gadamer in the 1950ies. Proceeding of course with Wittgenstein (language games, rule following), Austin (speech act theory) or Quine (criticizing empirism). Philosophers like John Searle, Hilary Putnam and Robert Brandom then explicating and extending the work of the former heroes. And those have been accompanied by many others. If you wonder about linguistics missing here, well, then because linguistics does not provide theories about language. Today, the domain is largely caught by positivism and the corresponding analytic approach.

Here in his little piece we pose these questions in the context of certain relations between machines and texts. There are a lot of such relations, and even quite sophisticated or surprising ones. For instance, texts can be considered as kind of machines. Yet, they bear a certain note of (virtual) agency as well, resulting in a considerable non-triviality of this machine aspect of texts. Here we will not deal with this perspective. Instead, we just will take a look on the possibilities and the respective practices to handle or to “treat” texts with machines. Or, if you prefer, the treating of texts by machines, as far as a certain autonomy of machines could be considered as necessary to deal with texts at all.

Today, we can find a fast growing community of computer programmers that are dealing with texts as kind of unstructured information. One of the buzz-words is the so-called “semantic web”, another one is “sentiment analysis”. We won’t comment in any detail about those movements, because they are deeply flawed. The first one is trying to formalize semantics and meaning apriori, trying to render the world into a trivial machine. We repeatedly criticized this and we agree herein with Douglas Hofstadter. (see this discussion of his “Fluid Analogy”). The second is trying to identify the sentiment of a text or a “tweet”, e.g. about a stock or an organization, on the basis of statistical measures about keywords and their utterly naive “n-grammed” versions, without actually paying any notice to the problem of “understanding”. Such nonsense would not be as widespread if programmers would read only a few fundamental philosophical texts about language. In fact, they don’t, and thus they are condemned to visit any of the underdeveloped positions that arose centuries ago.

If we neglect the social role of texts for a moment, we might identify a single major role of texts, albeit we have to describe it then in rather general terms. We may say that the role of a text, as a specimen of many other texts from a large population, is its functioning as a medium for the externalization of mental content in order to serve the ultimate purpose, which consists of the possibility for a (re)construction of resembling mental content on the side of the interpreting person.

This interpretation is a primacy. It is not possible to assign meaning to text like a sticky note, then putting the text including the yellow sticky note directly into the recipients brain. That may sound silly, but unfortunately it’s the “theory” followed by many people working in the computer sciences. Interpretation can’t be controlled completely, though, not even by the mind performing it, not even by the same mind who seconds before externalized the text through writing or speaking.

Now, the notion of mental content may seem both quite vague and hopelessly general as well. Yet, in the previous chapter we introduced a structure, the choreostemic space, which allows to speak pretty precise about mental content. Note that we don’t need to talk about semantics, meaning or references to “objects” here. Mental content is not a “state” either. Thinking “state” and the mental together is much on the same stage as to seriously considering the existence of sea monsters in the end of 18th century, when the list science of Linnaeus was not yet reshaped by the upcoming historical turn in the philosophy of nature. Nowadays we must consider it as silly-minded to think about a complex story like the brain and its mind by means of “state”. Doing so, one confounds the stability of the graphical representation of a word in a language with the complexity of a multi-layered dynamic process, spanned between deliberate randomness, self-organized rhythmicity and temporary thus preliminary meta-stability.

The notion of mental content does not refer to the representation of referenced “objects”. We do not have maps, lists or libraries in our heads. Everything which we experience as inner life builds up from an enormous randomness through deep stacks of complex emergent processes, where each emergent level is also shaped from top-down, implicitly and, except the last one usually called “consciousness,” also explicitly. The stability of memory and words, of feelings and faculties is deceptive, they are not so stable at all.  Only their externalized symbolic representations are more or less stable, their stability as words etc.  can be shattered easily. The point we would like to emphasize here is that everything that happens in the mind is constructed on the fly, while the construction is completed only with the ultimate step of externalization, that is, speaking or writing. The notion of “mental content” is thus a bit misleading.

The mental may be conceived most appropriately as a manifold of stacked and intertwined processes. This holds for the naturalist perspective as well as for the abstract perspective, as he have argued in the previous chapter. It is simply impossible to find a single stable point within the (abstract) dynamics between model, concept, mediality and virtuality, which could be thought of as spanning a space. We called it the choreostemic space.

For the following remarks about the relation between text and machines and the practitioners engaged in building machines to handle texts we have to keep in mind just those two things: (i) there is a primacy of interpretation, (ii) the mental is a non-representative dynamic process that can’t be formalized (in the sense of “being represented” by a formula).

In turn this means that we should avoid to refer to formulas when going to build a “text machine”. Text machines will be helpful only if their understanding of texts, even if it is a rudimentary understanding, follows the same abstract principles as our human understanding of texts does. Machines pretending to deal with texts, but actually only moving dead formal symbols back and forth, as it is the case in statistical text mining, n-gram based methods and similar, are not helpful at all. The only thing that happens is that these machines introduce a formalistic structure into our human life. We may say that these techniques render humans helpful to machines.

Nowadays we can find a whole techno-scientific community that is engaged in the field of machine learning, devised to “textual data”. The computers are programmed in such a way that they can be used to classify texts. The idea is to provide some keywords, or anti-words, or even a small set of sample texts, which then are taken by the software as a kind of template that is used to build a selection model. This model then is used to select resembling texts from a large set of texts. We have to be very clear about the purpose of these software programs: they classify texts.

The input data for doing so is taken from the texts themselves. More precisely, they are preprocessed according to specialized methods. Each of the texts gets described by a possibly large set of “features” that have been extracted by these methods. The obvious point is that the procedure is purely empirical in the strong sense. Only the available observations (the texts) are taken to infer the “similarity” between texts. Usually, not even linguistic properties are used to form the empirical observations, albeit there are exceptions. People use the so-called n-gram approach, which is only little more than counting letters. It is a zero-knowledge model about the series of symbols, which humans interpret as text. Additionally, the frequency or relative positions of keywords and anti-words are usually measured and expressed by mostly quite simple statistical methods.

Well, classifying texts is something that is quite different from understanding texts. Of course. Yet, said community tries to reproduce the “classification” achieved or produced by humans. Such, any of the engineers of the field of machine learning directed to texts implicitly claims kind of an understanding. They even organize competitions.

The problems with the statistical approach are quite obvious. Quine called it the dogma of empiricism and coined the Gavagai anecdote about it, which even provides much more information than the text alone. In order to understand a text we need references to many things outside the particular text(s) at hand. Two of those are especially salient: concepts and the social dimension. Straightly opposite to the believe of positivists, concepts can’t be defined in advance to a particular interpretation. Using catalogs of references does not help much, if these catalogs are used just as lists of references. The software does not understand “chair” by the “definition” stored in a database, or even by the set of such references. It simply does not care whether there are encoded ASCII codes that yield the symbol “chair” or the symbol “h&e%43”. Douglas Hofstadter has been stressing this point over and over again, and we fully agree to that.

From that necessity to a particular and rather wide “background” (notion by Searle) the second problem derives, which is much more serious, even devastating to the soundness of the whole empirico-statistical approach. The problem is simple: Even we humans have to read a text before being able to understand it. Only upon understanding we could classify it. Of course, the brain of many people is trained sufficiently as to work about the relations of the texts and any of its components while reading the text. The basic setup of the problem, however, remains the same.

Actually, what is happening is a constantly repeated re-reading of the text, taking into account all available insights regarding the text and the relations of it to the author and the reader, while this re-reading often takes place in the memory. To perform this demanding task in parallel, based on the “cache” available from memory, requires a lot of experience and training, though. Less experienced people indeed re-read the text physically.

The consequence of all of that is that we could not determine the best empirical discriminators for a particular text in-the-reading in order to select it as-if we would use a model. Actually, we can’t determine the set of discriminators before we have read it all, at least not before the first pass. Let us call this the completeness issue.

The very first insight is thus that a one-shot approach in text classification is based on a misconception. The software and the human would have to align to each other in some kind of conversation. Otherwise it can’t be specified in principle what the task is, that is, which texts should actually be selected. Any approach to text classification not following the “conversation scheme” is necessarily bare nonsense. Yet, that’s not really a surprise (except for some of the engineers).

There is a further consequence of the completeness issue. We can’t set up a table to learn from at all. This too is not a surprise, since setting up a table means to set up a particular symbolization. Any symbolization apriori to understanding must count as a hypothesis. Such simple. Whether it matches our purpose or not, we can’t know before we didn’t understand the text.

However, in order to make the software learning something we need assignates (traditionally called “properties”) and some criteria to distinguish better models from less performant models. In other words, we need a recurrent scheme on the technical level as well.

That’s why it is not perfectly correct to call texts “unstructured data”. (Besides the fact that data are not “out there”: we always need a measurement device, which in turn implies some kind of model AND some kind of theory.) In the case of texts, imposing a structure onto a text simply means to understand it. We even could say that a text as text is not structurable at all, since the interpretation of a text can’t never be regarded as finished.

All together, we may summarize the issue of complexity of texts as deriving from the following properties in the following way:

  • – there are different levels of context, which additionally stretch across surrounds of very different sizes;
  • – there are rich organizational constraints, e.g. grammars
  • – there is a large corpus of words, while any of them bears meaning only upon interpretation;
  • – there is a large number of relations that not only form a network, but which also change dynamically in the course of reading and of interpretation;
  • – texts are symbolic: spatial neighborhood does not translate into reference, in neither way;
  • understanding of texts requires a wealth of external, and quite abstract-concepts, that appear as significant only upon interpretation, as well as a social embedding of mutual interpretation,.

This list should at least exclude any attempt to defend the empirico-statistical approach as a reasonable one. Except the fact that it conveys a better-than-nothing attitude. These brings us to the question of utility.

Engineers build machines that are supposedly useful, more exactly, they are intended to be fulfill a particular purpose. Mostly, however, machines, even any technology in general, is useful only upon processes of subjective appropriation. The most striking example for this is the car. Else, computers have evolved not for reasons of utility, but rather for gaming. Video did not become popular for artistic reasons or for commercial ones, but due to the possibilities the medium offered for the sex industry. The lesson here being that an intended purpose is difficult to achieve as of the actual usage of the technology. On the other hand, every technology may exert some gravitational forces to develop a then unintended symbolic purpose and regarding that even considerable value. So, could we agree that the classification of texts as it is performed by contemporary technology is useful?

Not quite. We can’t regard the classification of texts as it is possible with the empirico-statistical approach as a reasonable technology. For the classification of texts can’t be separated from their understanding. All we can accomplish by this approach is to filter out those texts that do not match our interests with a sufficiently high probability. Yet, for this task we do not need text classification.

Architectures like 3L-SOM could also be expected to play an important role in translation, as translation requires even deeper understanding of texts as it is needed for sorting texts according to a template.

Besides the necessity for this doubly recurrent scheme we haven’t said much so far here about how then actually to treat the text. Texts should not be mistaken as empiric data. That means that we have to take a modified stance regarding measurement itself. In several essays we already mentioned the conceptual advantages of the two-layered (TL) approach based on self-organizing maps (TL-SOM). We already described in detail how the TL-SOM works, including the the basic preparation of the random graph as it has been described by Kohonen.

The important thing about TL-SOM is that it is not a device for modeling the similarity of texts. It is just a representation, even as it is a very powerful one, because it is based on probabilistic contexts (random graphs). More precisely, it is just one of many possible representations, even as it is much more appropriate than n-gram and other jokes. We even should NOT consider the TL-SOM as so-called “unsupervised modeling”, as the distinction between unsupervised vs. supervised is just another myth (=nonsense if it comes to quantitative models). The TL-SOM is nothing else than an instance for associative storage.

The trick of using a random graph (see the link above) is that the surrounds of words are differentially represented as well. The Kohonen model is quite scarce in this respect, since it applies a completely neutral model. In fact, words in a text are represented as if they would be all the same: of the same kind, of the same weight, etc. That’s clearly not reasonable. Instead, we should represent a word in several, different manners into the same SOM.

Yet, the random graph approach should not be considered just as a “trick”. We repeatedly argued (for instance here) that we have to “dissolve” empirical observations into a probabilistic (re)presentation in order to evade and to avoid the pseudo-problem of “symbol grounding”. Note that even by the practice of setting up a table in order to organize “data” we are already crossing the rubicon into the realm of the symbolic!

The real trick of the TL-SOM, however, is something completely different. The first layer represents the random graph of all words, the actual pre-specific sorting of texts, however, is performed by the second layer on the output of the first layer. In other words, the text is “renormalized”, the SOM itself is used as a measurement device. This renormalization allows to organize data in a standardized manner while allowing to avoid the symbolic fallacy. To our knowledge, this possible usage of the renormalization principle has not been recognized so far. It is indeed a very important principle that puts many things in order. We will deal later in a separate contribution with this issue again.

Only based on the associative storage taken as an entirety appropriate modeling is possible for textual data. The tremendous advantage of that is that the structure for any subsequent consideration now remains constant. We may indeed set up a table. The content of this table, the data, however is not derived directly from the text. Instead we first apply renormalization (a technique known from quantum physics, cf. [1])

The input is some description of the text completely in terms of the TL-SOM. More explicit, we have to “observe” the text as it behaves in the TL-SOM. Here, we are indeed legitimized to treat the text as an empirical observation, albeit we can, of course, observe the text in many different ways. Yet, observing means to conceive the text as a moving target, as a series of multitudes.

One of the available tools is Markov modeling, either as Markov chains, or by means of Hidden Markov Models. But there are many others. Most significantly, probabilistic grammars, even probabilistic phrase structure grammars can be mapped onto Markov models. Yet, again we meet the problem of apriori classification. Both models, Markovian as well as grammarian, need an assignment of grammatical type to a phrase, which often first requires understanding.

Given the autonomy of text, their temporal structure and the impossibility to apply apriori schematism, our proposal is that we just have to conceive of the text like we do of (higher) animals. Like an animal in its habitat, we may think of the text as inhabiting the TL-SOM, our associative storage. We can observe paths, their length and form, preferred neighborhoods, velocities, size and form of habitat.

Similar texts will behave in a similar manner. Such similarity is far beyond (better: as if from another planet) the statistical approach. We also can see now that the statistical approach is being trapped by the representationalist fallacy. This similarity is of course a relative one. The important point here is that we can describe texts in a standardized manner strictly WITHOUT reducing their content to statistical measures. It is also quite simple to determine the similarity of texts, whether as a whole, or whether regarding any part of it. We need not determine the range of our source at all apriori to the results of modeling. That modeling introduces a third logical layer. We may apply standard modeling, using a flexible tool for transformation and a further instance of a SOM, as we provide it as SomFluid in the downloads. The important thing is that this last step of modeling has to run automatically.

The proposed structure keeps any kind of reference completely intact. It also draws on its collected experience, that is, all texts it have been digesting before. It is not necessary to determine stopwords and similar gimmicks. Of course, we could, but that’s part of the conversation. Just provide an example of any size, just as it is available. Everything from two words, to a sentence, to a paragraph, to the content of a directory will work.

Such a 3L-SOM is very close to what we reasonably could call “understanding texts”. But does it really “understand”?

As such, not really. First, images should be stored in the same manner (!!), that is, preprocessed as random graphs over local contexts of various size, into the same (networked population of) SOM(s). Second, a language production module would be needed. But once we have those parts working together, then there will be full understanding of texts.

(I take any reasonable offer to implement this within the next 12 months, seriously!)

Conclusion

Understanding is a faculty to move around in a world of symbols. That’s not meant as a trivial issue. First, the world consists of facts, where facts comprise an universe of dynamic relations. Symbols are just not like traffic signs or pictograms as these belong to the more simple kind of symbols. Symbolizing is a complex, social, mediatized diachronic process.

Classifying, understood as “performing modeling and applying models” consists basically of two parts. One of them could be automated completely, while the other one could not treated by a finite or apriori definable set of rules at all: setting the purpose. In the case of texts, classifying can’t be separated from understanding, because the purpose of the text emerges only upon interpretation, which in turn requires a manifold of modeling raids. Modeling a (quasi-)physical system is completely different from that, it is almost trivial. Yet, the structure of a 3L-SOM could well evolve into an arrangement that is capable to understand in a similar way as we humans do. More precisely, and a bit more abstract, we also could say, that a “system” based on a population of 3L-SOM once will be able to navigate in the choreostemic space.

References
  • [1] B. Delamotte (2003). A hint of renormalization. Am.J.Phys. 72 (2004) 170-184, available online: arXiv:hep-th/0212049v3.

۞

Transformation

May 17, 2012 § Leave a comment

In the late 1980ies there was a funny, or strange, if you like,

discussion in the German public about a particular influence of the English language onto the German language. That discussion got not only teachers engaged in higher education going, even „Der Spiegel“, Germany’s (still) leading weekly news magazine damned the respective „anglicism“. What I am talking about here considers the attitude to „sense“. At those times well 20 years ago, it was meant to be impossible to say „dies macht Sinn“, engl. „this makes sense“. Speakers of German at that time understood the “make” as “to produce”. Instead, one was told, the correct phrase had to be „dies ergibt Sinn“, in a literal, but impossible translation something like „this yields sense“, or even „dies hat Sinn“, in a literal, but again wrong and impossible translation, „this has sense“. These former ways of building a reference to the notion of „sense“ feels even awkward for many (most?) speakers of German language today. Nowadays, the English version of the meaning of the phrase replaced the old German one, and one even can find in the “Spiegel“ now the analogue to “making” sense.

Well, the issue here is not just one historical linguistics or one of style. The differences that we can observe here are deeply buried into the structure of the respective languages. It is hard to say whether such idioms in German language are due to the history of German Idealism, or whether this particular philosophical stance developed on the basis of the structures in the language. Perhaps a bit of both, one could say from a Wittgensteinian point of view. Anyway, we may and can be relate such differences in “contemporary” language to philosophical positions.

It is certainly by no means an exaggeration to conclude that the cultures differ significantly in what their languages allow to be expressible. Such a thing as an “exact” translation is not possible beyond trivial texts or a use of language that is very close to physical action. Philosophically, we may assign a scale, or a measure, to describe the differences mentioned above in probabilistic means, and this measure spans between pragmatism and idealism. This contrast also deeply influences philosophy itself. Any kind of philosophy comes in those two shades (at least), often expressed or denoted by the attributes „continental“ and „anglo-american“. I think these labels just hide the relevant properties. This contrast of course applies to the reading of idealistic or pragmatic philosophers itself. It really makes a difference (1980ies German . . . „it is a difference“) whether a native English speaking philosopher reads Hegel, or a German native, whether a German native is reading Peirce or an American guy, whether Quine conducts research in logic or Carnap. The story quickly complicates if we take into consideration French philosophy and its relation to Heidegger, or the reading of modern French philosophers in contemporary German speaking philosophy (which is almost completely absent).1

And it becomes even more complicated, if not complex and chaotic, if we consider the various scientific sub-cultures as particular forms of life, formed by and forming their own languages. In this way it may well seem to be rather impossible—at least, one feels tempted to think so—to understand Descartes, Leibniz, Aristotle, or even the pre-Socratics, not to speak about the Cro-Magnon culture2, albeit it is probably more appropriate to reframe the concept of understanding. After all, it may itself be infected by idealism.

In the chapters to come you may expect the following sections. As we did before we’ll try to go beyond the mere technical description, providing the historical trace and the wider conceptual frame:

A Shift of Perspective

Here, I need this reference to the relativity as it is introduced in—or by­ —language for highlighting a particular issue. The issue concerns a shift in preference, from the atom, the point, from matter, substance, essence and metaphysical independence towards the relation and its dynamic form, the transformation. This shift concerns some basic relationships of the weave that we call “Lebensform” (form of life), including the attitude towards those empiric issues that we will deal with in a technical manner later in this essay, namely the transformation of “data”. There are, of course, almost countless aspects of the topos of transformation, such like evolutionary theory, the issue of development, or, in the more abstract domains, mathematical category theory. In some way or another we already dealt with these earlier (for category theory, for evolutionary theory). These aspects of the concept of transformation will not play a role here.

In philosophical terms the described difference between German and English language, and the change of the respective German idiom  marks the transition from idealism to pragmatism. This corresponds to the transition from a philosophy of primal identity to one where difference is transcendental. In the same vein, we could also set up the contrast between logical atomism and the event as philosophical topoi, or between favoring existential approaches and ontology against epistemology. Even more remarkably, we also find an opposing orientation regarding time. While idealism, materialism, positivism or existentialism (and all similar attitudes) are heading backwards in time, and only backwards, pragmatism and, more generally, a philosophy of events and transformation is heading forward, and only forward. It marks the difference between settlement (in Heideggerian „Fest-Stellen“, English something like „fixing at a location“, putting something into the „Gestell“3) and anticipation. Settlements are reflected by laws of nature in which time does not—and shall not—play a significant role. All physical laws, and almost all theories in contemporary physics are symmetric with respect to time. The “law perspective” blinds against the concept of context, quite obviously so. Yet, being blinded against context also disables to refer to information in an adequate manner.

In contrast, within a framework that is truly based on the primacy of interpretation and thus following the anticipatory paradigm, it does not make sense to talk about “laws”. Notably, issues like the “problem” of induction exist only in the framework of the static perspective of idealism and positivism.

It is important to understand that these attitudes are far from being just “academic” distinctions. There are profound effects to be found on the level of empiric activity, how data are handled using which kind of methods. Further more, they can’t be “mixed”, once one of them have been chosen. Despite we may switch between them in a sequential manner, across time or across domains, we can’t practice them synchronously as the whole setup of the life form is influenced. Of course, we do not want to rate one of them as the “best”, we just want to ensure that it is clear that there are particular consequences of that basic choice.

Towards the Relational Perspective

As late as 1991, Robert Rosen’s work about „Relational Biology“ has been anything but nearby [1]. As a mathematician, Rosen was interested in the problematics of finding a proper way to represent living systems by formal means. As a result of this research, he strongly proposed the “relational” perspective. He identifies Nicolas Rashevsky as the originator of it, who mentioned about it around 1935 for the first time. It really sounds strange that relational biology had to be (re-)invented. What else than relations could be important in biology? Yet, still today the atomistic thinking is quite abundant, think alone about the reductionist approaches in genetics (which fortunately got seriously attacked meanwhile4). Or think about the still prevailing helplessness in various domains to conceive appropriately about complexity (see our discussion of this here). Being aware of relations means that the world is not conceived as made from items that are described by inputs and outputs with some analytics, or say deterministics, in between. Only such items could be said that they “function”. The relational perspective abolishes the possibility of the reduction of real “systems” to “functions”.

As it is already indicated by the appearance of Rashevsky, there is, of course, a historical trace for this shift, kind of soil emerging from intellectual sediments.5 While the 19th century could be considered as being characterized by the topos of population (of atoms)—cf. the line from Laplace and Carnot to Darwin and Boltzmann—we can observe a spawning awareness for the relation in the 20th century. Wittgenstein’s Tractatus started to oppose Frege and has been always in stark contrast to logical positivism, then accompanied by Zermelo (“axiom” of choice6), Rashevsky (relational biology), Turing (morphogenesis in complex systems), McLuhan (media theory), String Theory in physics, Foucault (field of propositions), and Deleuze (transcendental difference). Comparing Habermas and Luhmann on the one side—we may label their position as idealistic functionalism—with Sellars and Brandom on the other—who have been digging into the pragmatics of the relation as it is present in humans and their culture—we find the same kind of difference. We also could include Gestalt psychology as kind of a pre-cursor to the party of “relationalists,” mathematical category theory (as opposed to set theory) and some strains from the behavioral sciences. Researchers like Ekman & Scherer (FACS), Kummer (sociality expresses as dynamics in relative positions), or Colmenares (play) focused the relation itself, going far beyond the implicit reference to the relation as a secondary quality. We may add David Shane7 for architecture and Clarke or Latour8 for sociology. Of course, there are many, many other proponents who helped to grow the topos of the relation, yet, even without a detailed study we may guess that compared to the main streams they still remain comparatively few.

These difference could not be underestimated in the field of information sciences, computer sciences, data analysis, or machine-based learning and episteme. It makes a great difference whether one would base the design of an architecture or the design of use on the concept of interfaces, most often defined as a location of full control, notably in both directions, or on the concept of behavioral surfaces.9. In the field of empiric activities, that is modeling in its wide sense, it yields very different setups or consequences whether we start with the assumption of independence between our observables or between our observations or whether we start with no assumptions about the dependency between observables, or observations, respectively. The latter is clearly the preferable choice in terms of intellectual soundness. Even if we stick to the first of both alternatives, we should NOT use methods that work only if that assumption is satisfied. (It is some kind of a mystery that people believe that doing so could be called science.) The reason is pretty simple. We do not know anything about the dependency structures in the data before we have finished modeling. It would inevitably result in a petitio principii if we’d put “independence” into the analysis, wrapped into the properties of methods. We would just find. . . guess what. After destroying facts—in the Wittgensteinian sense understood as relationalities—into empiristic dust we will not be able to find any meaningful relation at all.

Positioning Transformation (again)

Similarly, if we treat data as a “true” mapping of an outside “reality”, as “givens” that eventually are distorted a bit by more or less noise, we will never find multiplicity in the representations that we could derive from modeling, simply because it would contradict the prejudice. We also would not recognize all the possible roles of transformation in modeling. Measurement devices act as a filter10, and as such it does not differ from any analytic transformation of the data. From the perspective of the associative part of modeling, where the data are mapped to desired outcomes or decisions, “raw” data are simply not distinguishable from “transformed” data, unless the treatment itself would not be encoded as data as well. Correspondingly, we may consider any data transformation by algorithmic means as additional measurement devices, which are responding to particular qualities in the observations on their own. It is this equivalence that allows for the change from the linear to a circular and even a self-referential arrangement of empiric activities. Long-term adaptation, I would say even any adaptation at all is based on such a circular arrangement. The only thing we’d to change to earn the new possibilities was to drop the “passivist” representationalist realism11.

Usually, the transformation of data is considered as an issue that is a function of discernibility as an abstract property of data (Yet, people don’t talk like that, it’s our way of speaking here). Today, the respective aphorism as coined by Bateson already became proverbial, despite its simplistic shape: Information is the difference that makes the difference. According to the context in which data are handled, this potential discernibility is addressed in different ways. Let us distinguish three such contexts: (i) Data warehousing, (ii) statistics, and (iii) learning as an epistemic activity.

In Data Warehousing one is usually faced with a large range of different data sources and data sinks, or consumers, where the difference of these sources and sinks simply relates to the different technologies and formats of data bases. The warehousing tool should “transform” the data such that they can be used in the intended manner on the side of the sinks. The storage of the raw data as measured from the business processes and the efforts to provide any view onto these data has to satisfy two conditions (in the current paradigm). It has to be neutral—data should not be altered beyond the correction of obvious errors—and its performance, simply in terms of speed, has to be scalable, if not even independent from the data load. The activities in Data Warehousing are often circumscribed as “Extract, Transform, Load”, abbreviated ETL. There are many and large software solutions for this task, commercial ones and open source (e.g. Talend). The effect of DWH is to disclose the potential for an arbitrary and quickly served perspective onto the data, where “perspective” means just re-arranged columns and records from the database. Except cleaning and simple arithmetic operations, the individual bits of data itself remain largely unchanged.

In statistics, transformations are applied in order to satisfy the conditions for particular methods. In other words, the data are changed in order to enhance discernibility. Most popular is the log-transformation that shifts the mode of a distribution to the larger values. Two different small values that consequently are located nearby are separated better after a log-transformation, hence it is feasible to apply log-transformation to data that form a left-skewed distribution. Other transformations are aiming at a particular distribution, such as the z-score, or Fisher’s z-transformation. Interestingly, there is a further class of powerful transformations that is not conceived as such. Residuals are defined as deviation of the data from a particular model. In linear regression it is the square of the distance to the regression line.

The concept, however, can be extended to those data which do not “follow” the investigated model. The analysis of residual has two aspects, a formal one and an informal one. Formally, it is used as a complex test whether the investigated model does fit or whether it does not. The residual should not show any evident “structure”. That’s it. There is no institutional way back to the level of the investigated model, there are no rules about that, which could be negotiated in a yet to establish community. The statistical framework is a linear one, which could be seen as a heritage from positivism. It is explicitly forbidden to “optimize” a correlation by multiple actualization. Yet, informally the residuals may give hints on how to change the basic idea as represented by the model. Here we find a circular setup, where the strategy is to remove any rule-based regularity, i.e. discernibility form the data.

The effect of this circular arrangement takes completely place in the practicing human as kind of a refinement. It can’t be found anywhere in the methodological procedure itself in a rule-based form. This brings us to the third area, epistemic learning.

In epistemic learning, any of the potentially significant signals should be rendered in such a way as to allow for an optimized mapping towards a registered outcome. Such outcomes often come as dual values, or as a small group of ordinal values in the case of multi-constraint, multi-target optimization. In epistemic learning we thus find the separation of transformation and association in its most prominent form, despite the fact that data warehousing and statistics as well also are intended to be used for enhancing decisions. Yet, their linearity simply does not allow for any kind of institutionalized learning.

This arbitrary restriction to the linear methodological approach in formal epistemic activities results in two related quite unfavorable effects: First, the shamanism of “data exploration”, and second, the infamous hell of methods. One can indeed find thousands, if not 10s of thousands of research or engineering articles trying to justify a particular new method as the most appropriate one for a particular purpose. These methods themselves however are never identified as a „transformation“. Authors are all struggling for the “best” method, the whole community being neglecting the possibility—and the potential—of combining different methods after shaping them as transformations.

The laborious and never-ending training necessary to choose from the huge amount of possible methods then is called methodology… The situation is almost paradox. First, the methods are claimed to tell something about the world, despite this is not possible at all, not just because those methods are analytic.  It is an idealistic hope, which has been abolished already by Hume. Above all, only analytic methods are considered to be scientific. Then, through the large population of methods the choice for a particular one becomes aleatory, which renders the whole activity into a deeply non-scientific one. Additionally, it is governed by the features of some software, or the skills of the user of such software, not by a conceptual stance.

Now remember that any method is also a specific filter. Obviously, nothing could be known about the beneficiality of a particular method before the prediction that is based on the respective model had been validated. This simple insight renders “data exploration” into meaninglessness. It can only play its role within linear empirical frameworks, which are inappropriate any way. Data exploration is suggested to be done “intuitively”, often using methods of visualization. Yet, those methods are severely restricted with regard to the graspable dimensionality. More than 6 to 8 dimensions can’t be “visualized” at once. Compare this to the 2n (n: number of variables) possible models and you immediately see the problem. Else, the only effect of visualization is just a primitive form of clustering. Additionally, visual inputs are images, above all, and as images they can’t play a well-defined epistemological role.12

Complementary to the non-concept of “exploring” data13, and equally misconceived, is the notion of “preparing” data. At least, it must be rated as misconceived as far as it comprises transformations beyond error correction and arranging data into tables. The reason is the same: We can’t know whether a particular “cleansing” will enhance the predictive power of the model, in other words, whether it comprises potential information that supports the intended discernibility, before the model has been built. There is no possibility to decide which variables to include before having finished the modeling. In some contexts the information accessible through a particular variable could be relevant or even important. Yet, if we conceive transformations as preliminary hypothesis we can’t call them “preparation” any more. “Preparation” for what? For proofing the petitio principii? Certainly the peak of all preparatory nonsense is the “imputation” of missing values.

Dorian Pyle [11] calls such introduced variables “pseudo variables”, others call them “latent” or even “hidden variables”.14 Any of these labels is inappropriate, since the transformation is nothing else than a measurement device. Introduced variables are just variables, nothing else.

Indeed, these labels are reliable markers: whenever you meet a book or article dealing with data exploration, data preparation, the “problem” of selecting a method, or likewise, selecting an architecture within a meta-method like the Artificial Neural Networks, you can know for sure that the author is not really interested in learning and reliable predictions. (Or, that he or she is not able to distinguish analysis from construction.)

In epistemic learning the handling of residuals is somewhat inverse to their treatment in statistics, again as a result of the conceptual difference between the linear and the circular approach. In statistics one tries to prove that the model, say: transformation, removes all the structure from the data such that the remaining variation is pure white noise. Unfortunately, there are two drawbacks with this. First, one has to define the model before removing the noise and before checking the predictive power. Secondly, the test for any possibly remaining structure again takes place within the atomistic framework.

In learning we are interested in the opposite. We are looking for such transformations which remove the noise in a multi-variate manner such that the signal-noise ratio is strongly enhanced, perhaps even to the proto-symbolic level. Only after the de-noising due to the learning process, that is after a successful validation of the predictive model, the structure is then described for the (almost) noise-free data segment15 as an expression that is complementary to the predictive model.

In our opinion an appropriate approach would actualize as an instance of epistemic learning that is characterized by

  • – conceiving any method as transformation;
  • – conceiving measurement as an instance of transformation;
  • – conceiving any kind of transformation as a hypothesis about the “space of expressibility” (see next section), or, similarly, the finally selected model;
  • – the separation of transformation and association;
  • – the circular arrangement of transformation and association.

The Abstract Perspective

We now have to take a brief look onto the mechanics of transformations in the domain of epistemic activities.16 For doing this, we need a proper perspective. As such we choose the notion of space. Yet, we would like to emphasize that this space is not necessarily Euclidean, i.e. flat, or open, like the Cartesian space, i.e. if quantities running to infinite. Else, dimensions need not be thought of as being “independent”, i.e. orthogonal on each other. Distance measures need to be defined only locally, yet, without implying ideal continuity. There might be a certain kind of “graininess” defined by a distance D, below which the space is not defined. The space may even contain “bubbles” of lower dimensionality. So, it is indeed a very general notion of “space”.

Observations shall be represented as “points” in this space. Since these “points” are not independent from the efforts of the observer, these points are not dimensionless. To put it more precisely, they are like small “clouds”, that are best described as probability densities for “finding” a particular observation. Of course, this “finding” is kind of an inextricable mixture of “finding” and “constructing”. It does not make much sense to distinguish both on the level of such cloudy points. Note, that the cloudiness is not a problem of accuracy in measurement! A posteriori, that is, subsequent to introducing an irreversible move17, such a cloud could also be interpreted as an open set of the provoked observation and virtual observations. It should be clear by now that such a concept of space is very different from the Euclidean space that nowadays serves as a base concept for any statistics or data mining. If you think that conceiving such a space is not necessary or even nonsense, then think about quantum physics. In quantum physics we also are faced with the break-down of observer and observable, and they ended up quite precisely in spaces as we described it above. These spaces then are handled by various means of renormalization methods.18 In contrast to the abstract yet still physical space of quantum theory, our space need not even contain an “origin”. Elsewhere we called such a space aspectional space.

Now let us take the important step in becoming interested in only a subset of these observations. Assume we not only want to select a very particular set of observations—they are still clouds of probabilities, made from virtual observations—by means of prediction. This selection now can be conceived in two different ways. The first way is the one that is commonly applied and consists of the reconstruction of a “path”. Since in the contemporary epistemic life form of “data analysts” Cartesian spaces are used almost exclusively, all these selection paths start from the origin of the coordinate system. The endpoint of the path is the point of interest, the “outcome” that should be predicted. As a result, one first gets a mapping function from predictor variables to the outcome variable. All possible mappings form the space of mappings, which is a category in the mathematical sense.

The alternative view does not construct such a path within a fixed coordinate system, i.e. with a space with fixed properties. Quite to the contrast, the space itself gets warped and transformed until very simple figures appear, which represent the various subsets of observations according to the focused quality.

Imagine an ordinary, small, blown-up balloon. Next, imagine a grid in the space enclosed by the balloon’s hull, made by very thin threads. These threads shall represent the space itself. Of course, in our example the space is 3d, but it is not limited to this case. Now think of two kinds of small pearls attached to the threads all over the grid inside the balloon, blue ones and red ones. It shall be the red ones in which we are interested. The question now is what can we do to separate the blue ones from the red ones?

The way to proceed is pretty obvious, though the solution itself may be difficult to achieve. What we can try is to warp and to twist, to stretch, to wring and to fold the balloon in such a way that the blue pearls and the red pearls separate as nicely as possible. In order to purify the groups we may even consider to compress some regions of the space inside the balloon such that they are turn into singularities. After all this work—and beware it is hard work!—we introduce a new grid of threads into the distorted space and dissolve the old ones. All pearls automatically attach to the threads closest nearby, stabilizing the new space. Again, conceiving of such a space may seem weird, but again we can find a close relative in physics, the Einsteinian space of space-time. Gravitation effectively is warping that space, though in a continuous manner. There are famous empirical proofs of that warping of physical space-time.19

Analytically, these two perspectives, the path reconstruction on the hand and the space warping on the other, are (almost) equivalent. The perspective of space warping, however, offers a benefit that is not to be underestimated. We arrive at a new space for which we can define its own properties and in which we again can define measures that are different from those possible in the original space. The path reconstruction does not offer such a “a derived space”. Hence, once the path is reconstructed, the story stops. It is a linear story. Our proposal thus is to change perspective.

Warping the space of measurability and expressibility is an operation that inverts the generation of cusp catastrophes.20 (see Figure 1 below). Thus it transcends the cusp catastrophes. In the perspective of path reconstruction one has to avoid the phenomenon of hysteresis and cusps altogether, hence loosing a lot of information about the observed source of data.

In the Cartesian space and the path reconstruction methodology related to it, all operations are analytic, that is organized as symbolic rewriting. The reason for this is the necessity for the paths remaining continuous and closed. In contrast, space warping can be applied locally. Warping spaces in dealing with data is not an exotic or rare activity at all. It happens all the time. We know it even from (simple) mathematics, when we define different functions, including the empty function, for different sets of input parameter values.

The main consequence of changing the perspective from path reconstruction to space warping is an enlargement of the set of possible expressions. We can do more without the need to call it “heuristics”. Our guess is that any serious theory of data and measurement must follow the opened route of space warping, if this theory of data tries to avoid positivistic reductionism. Most likely, such a theory will be kind of a renormalization theory in a connected, relativistic data space.

Revitalizing Punch Cards and Stacks

In this section we will introduce the outline of a tool that allows to follow the circular approach in epistemic activities. Basically, this tool is about organizing arbitrary transformations. While for analytic (mathematical) expressions there are expression interpreters it is also clear that analytic expressions form only a subset of the set of all possible transformations, even if we consider the fact that many expression interpreters have been growing to some kind of programming languages, or script language. Indeed, Java contains an interpreting engine for JavaScript by default, and there are several quite popular ones for mathematical purposes. One could also conceive mathematical packages like Octave (open source), MatLab or Mathematica (both commercial) as such expression interpreters, even as their most recent versions can do much, much more. Yet, using MatLab & Co. are not quite suitable as a platform for general purpose data transformation.

The structural metaphor that proofed to be as powerful as it was sustainable for more than 10 years now is the combination of the workbench with the punch card stack.

Image 1: A Punched Card for feeding data into a computer

Any particular method, mathematical expression or arbitrary computational procedure resulting in a transformation of the original data is conceived as a “punch card”. This provides a proper modularization, and hence standardization. Actually, the role of these “functional compartments” is extremely standardized, at least enough to define an interface for plugins. Like the ancient punch cards made from paper, each card represents a more or less fixed functionality. Of course, these functionality may be defined by a plugin that itself connects to Matlab…

Else, again like the ancient punch cards, the virtualized versions can be stacked. For instance, we first put the treatment for missing values onto the stack, simply to ensure that all NULLS are written as -1. The next card then determines minimum and maximum in order to provide the data for linear normalization, i.e. the mapping of all values into the interval [0..1]. Then we add a card for compressing the “fat tail” of the distribution of values in a particular variable. Alternatively we may use a card to split the “fat tail” off into a new variable! Finally we apply the card=plugin for normalizing the data to the original and the new data column.

I think you got the idea. Such a stack is not only maintained for any of the variables, it is created on the fly according to the needs as these got detected by simple rules. You may think of the cards also as the set of rules that describe the capabilities of agents, which constantly check the data whether they could apply their rules. You also may think of these stacks as a device that works like a tailored distillation column , as it is used for fractional distillation in petro-chemistry.

Image 2: Some industrial fractional distillation columns for processing mineral oil. Dependent on the number of distillation steps different products result.

These stacks of parameterized procedures and expressions represent a generally programmable computer, or more precisely, operating system, quite similar to a spreadsheet, albeit the purpose of the latter, and hence the functionality, actualizes in a different form. The whole thing may even be realized as a language! In this case, one would not need a graphical user-interface anymore.

The effect of organizing the transformation of data in this way, by means of plugins that follow the metaphor of the “punch card stack”, is dramatic. Introducing transformations and testing them can be automated. At this point we should mention about the natural ally of the transformation workbench, the maximum likelihood estimation of the most promising transformations that combine just two or three variables into a new one. All three parts, the transformation stack engine, the dependency explorer, and the evolutionary optimized associative engine (which is able to create a preference weighting for the variables) can be put together in such a way that finding the “optimal” model can be run in a fully automated manner. (Meanwhile the SomFluid package has grown into a stage where it can accomplish this. . . download it here, but you need still some technical expertise to make it running)

The approach of the “transformation stack engine” is not just applicable to tabular data, of course. Given a set of proper plugins, it can be used as a digester for large sets of images or time series as well (see below).

Transforming Data

In this section we now will take a more practical and pragmatic perspective. Actually, we will describe some of the most useful transformations, including their parameters. We do so, because even prominent books about “data mining” have been handling the issue of transforming data in a mistaken or at least seriously misleading manner.21,22

If we consider the goal of the transformation of numerical data, increasing the discernibility of assignated observations , we will recognize that we may identify a rather limited number of types of such transformations, even if we consider the space of possible analytic functions, which combine two (or three) variables.

We will organize the discussion of the transformations into three sub-sections, whose subjects are of increasing complexity. Hence, we will start with the (ordinary) table of data.

Tabular Data

Tables may comprise numerical data or strings of characters. In its general form it may even contain whole texts, a complete book in any of the cells of a column (but see the section about unstructured data below!). If we want to access the information carried by the string data, we more sooner than later have to translate them into numbers. Unlike numbers, string data, and the relations between data points made from string data, must be interpreted. As a consequence, there are always several, if not many different possibilities of that representation. Besides referring to the actual semantics of the strings that could be expressed by means of the indices of some preference orders, there are also two important techniques of automatic scaling available, which we will describe below.

Besides string data, dates are further multi-dimensional category of data. A date encodes not only a serial number relative to some (almost) arbitrarily chosen base date, which we can use to express the age of the item represented by the observation. We have, of course, day of week, day of month, number of week, number of month, and not to forget about season as an approximate class. It depends a bit on the domain whether these aspects play any role at all. Yet, think about the rhythms in the city or on the stock markets across the week, or the “black Monday/ Tuesday/Friday effect” in production plants or hospitals then it is clear that we usually have to represent the single date value by several “informational extracts”.

A last class of data types that we have to distinguish are time values. We already mentioned the periodicity in other aspects of the calendar. In which pair of time values we find a closer similarity, T1( 23:41pm, 0:05pm), or T2(8:58am;3:17pm)? In case of any kind of distance measure the values of T2 are evaluated as much more similar than those in T1. What we have to do is to set a flag for “circularity” in order to calculate the time distances correctly.

Numerical Data: Numbers, just Numbers?

Numerical data are data for which in principle any value from within a particular interval could be observed. If such data are symmetrically normal distributed then we have little reasons to guess that there is something interesting within these sample of values. As soon as the distribution becomes asymmetrical, it starts to become interesting. We may observe “fat tails” (large values are “over-represented), or multi-modal distributions. In both cases we could suspect that there are at least two different processes, one dominating the other differentially across peaks. So we should split the variable into two (called “deciling”) and ceteris paribus check out the effect on the predictive power of the model. Typically one splits the values at the minimum between the peaks, but it is also possible to implement an overlap, where some records are present in both of the new variables.

Long tails indicate some aberrant behavior of the items represented by the respective records, or, like in medicine even pathological contexts. Strongly left-skewed distribution often indicate organizational or institutional influences. Here we could compress the long tail, log-shift, and then split the variable, that is decile it into two. 21

In some domains, like the finances, we find special values at which symmetry breaks. For ordinary money values the 0 is such a value. We know in advance that we have to split the variable into two, because the semantic and the structural difference between +50$ and -75$ is much bigger than between 150$ and 2500$… probably. As always, we transform it such that we create additional variables as kind of a hypotheses, for which we have to evaluate their (positive) contribution to the predictive power of the model.

In finances, but also in medicine, and more general in any system that is able to develop meta-stable regions, we have to expect such points (or regions) with increased probability of breaking symmetry and hence strong semantic or structural difference. René Thom first described similar phenomena by his theory that he labeled “catastrophe theory”. In 3d you can easily think about cusp catastrophes as a hysteresis in x-z direction that is however gradually smoothed out in y-direction.

Figure 1: Visualization of folds in parameters space, leading to catastrophes and hystereses.

In finances we are faced with a whole culture of rule following. The majority of market analysts use the same tools, for instance “stochasticity,” or a particularly parameterized MACD for deriving “signals”, that is, indicators for points of actions. The financial industries have been hiring a lot of physicists, and this population sticks to greatly the same mathematics, such as GARCH, combined with Monte-Carlo-Simulations. Approaches like fractal geometry are still regarded as exotic.23

Or think about option prices, where we find several symmetry breaks by means of contract. These points have to be represented adequately in dedicated, means derived variables. Again, we can’t emphasize it enough, we HAVE to do so as a kind of performing hypothesizing. The transformation of data by creating new variables is, so to speak, the low-level operationalization of what later may grow into a scientific hypothesis. Creating new variables poses serious problems for most methods, which may count as a reason why many people don’t follow this approach. Yet, for our approach it is not a problem, definitely not.

In medicine we often find “norm values”. Potassium in blood serum may take any value within a particular range without reflecting any physiologic problem. . . if the person is healthy. If there are other risk factors the story may be a different one. The ratio of potassium and glucose in serum provides us an example for a significant marker. . . if the person has already heart problems. By means of such risk markers we can introduce domain-specific knowledge. And that’s actually a good message, since we can identify our own “markers” and represent it as a transformation. The consequence is pretty clear: a system that is supposed to “learn” needs a suitable repository for storing and handling such markers, represented as a relational system (graph).

Let us return to the norm ranges briefly again. A small difference outside the norm range could be rated much more strongly than within the norm range. This may lead to the weight functions shown in the next figure, or more or less similar ones. For a certain range of input values, the norm range, we leave the values unchanged. The output weight equals 1. Outside of this range we transform them in a way that emphasizes the difference to the respective boundary value of the norm range. This could be done in different ways.

Figure 2: Examples for output weight configurations in norm-range transformation

Actually, this rationale of the norm range can be applied to any numerical data. As an estimate of the norm range one could use the 80% quantile, centered around the median and realized as +/-40% quantiles. On the level of model selection, this will result in a particular sensitivity for multi-dimensional outliers, notably before defining any criterion apriori of what an outlier should be.

From Strings to Orders to Numbers

Many data come as some kind of description or label. Such data are described as nominal data. Think for instance about prescribed drugs in a group of patients included into an investigation of risk factors for a disease, or think about the name or the type of restaurants in a urbanological/urbanistic investigation. Nominal data are quite frequent in behavioral, organizational or social data, that is, in contexts that are established mainly on a symbolic level.

It should be avoided to perform measurements only on the nominal scale, yet, sometimes it is not possible to circumvent it. It could be avoided at least partially by including further properties that can be represented by numerical values. For instance, instead using only the names cities in a data set, one can use the geographical location, number of inhabitants, or when referring to places within a city one can use descriptors that cover some properties of the respective area, such items as density of traffic, distance to similar locations, price level of consumer goods, economical structure etc. If a direct measurement is not possible, estimates can do the job as well, if the certainty of the estimate is expressed. The certainty then can be used to generate surrogate data. If the fine grained measurement creates further nominal variables, they could be combined for form a scale. Such enrichment is almost always possible, irrespective the domain. One should keep in mind, however, that any such enrichment is nothing else than a hypothesis.

Sometimes, data on the nominal level, technically a string of alphanumerical characters, already contains valuable information. For instance, the contain numerical values, as in the name of cars. If we would deal with things like names of molecules, where these names often come as compounds, reflecting the fact that molecules themselves are compounds, we can calculate the distance of each name to a virtual “average name” by applying a technique called “random graph”. Of course, in case of molecules we would have a lot of properties available that can be expressed as numerical values.

Ordinal data are closely related to nominal data. Essentially, there are two flavors of them. In case of the least valuable of them the numbers to not express a numerical value, the cipher is just used as kind of a letter, indicating that there is a set of sortable items. Sometimes, values of an ordinal scale represent some kind of similarity. Despite this variant is more valuable it still can be misleading, because the similarity may not scale isodistantly with the numerical values of the ciphers. Undeniably, there is still a rest of a “name” in it.

We are now going to describe some transformations to deal with data from low-level scales.

The least action we have to apply to nominal data is a basic form of encoding. We use integer values instead of the names. The next, though only slightly better level would be to reflect the frequency of the encoded item in the ordinal value. One would, for instance not encode the name into an arbitrary integer value, but into the log of the frequency. A much better alternative, however, is provided by the descendants of the correspondence analysis. These are called Optimal Scaling and the Relative Risk Weight. The drawback for these method is that some information about the predicted variable is necessary. In the context of modeling, by which we always understand target-oriented modeling—as opposed to associative storage24—we usually find such information, so the drawback is not too severe.

First to optimal scaling (OSC). Imagine a variable, or “assignate” as we prefer to call it25, which is scaled on the nominal or the low ordinal scale. Let us assume that there are just three different names or values. As already mentioned, we assume that a purpose has been selected and hence a target variable as its operationalization is available. Then we could set up the following table (the figures are denoting frequencies).

Table 1: Summary table derived from a hypothetical example data set. av(i) denote three nominally scaled assignates.

outcometv

av1

av2

av3

marginal sum

ta

140

120

160

420

tf (focused)

30

10

40

80

marginal sum

170

130

200

500

From these figures we can calculate the new scale values by the formula

For the assignate av1 this yields

Table 2: Here, various encodings are contrasted.

assignate

literal encoding

frequency

normalized log(freq)

optimal scaling

normalized OSC

av1

1

170

0.62

0.176

0.809

av2

2

130

0.0

0.077

0.0

av3

3

200

1.0

0.200

1.0

Using these values we could replace any occurrence of the original nominal (ordinal) values by the scaled values. Alternatively—or better additionally—, we could sum up all values for each observation (record), thereby collapsing the nominally scaled assignates into a single numerically scaled one.

Now we will describe the RRW. Imagine a set of observations {o(i)} where each observation is described by a set of assignates a(i). Also let us assume that some of these assignates are on the binary level, that is, the presence of this quality in the observation is encoded by “1”, its missing by “0”. This usually results in sparsely filled (regions of ) the data table. Depending on the size of the “alphabet”, even more than 99.9% of all values could simply be equal to 0. Such data can not be grouped in a reasonable manner. Additionally, if there are further assignates in the table that are not binary encoded, the information in the binary variables would be neglected almost completely without applying a rescaling like the RRW.

For the assignate av1 this yields

As you can see, the RRW uses the marginal from the rows, while the optimal scaling uses the marginal from the columns. Thus, the RRW uses slightly more information. Assuming a table made from binary assignates av(i), which could be summarized into table 1 above, the formula yields the following RRW factors for the three binary scaled assignates:

Table 3: Relative Risk Weights (RRW) for the frequency data shown in table 1.

Assignate

raw RRWi

RRWi

normalized RRW

av1

1.13

0.33

0.82

av2

0.44

0.16

0.00

av3

1.31

0.36

1.00

The ranking of av(i) based RRW is equal to that returned by OSC, even the normalized score values are quite similar. Yet, while in the case of nominal variables assignates are usually not collapsed, this will be done always in case of binary variables.

So, let us summarize these simple methods in the following table.

Table 4: Overview about some of the most important transformations for tabular data.

Transformation

Mechanism

Effect, New Value

Properties, Conditions

log-transform

analytic function

analytic combination

explicit analytic function (a,b)→f(a,b)

enhancing signal-to-noise ratio for the relationship between predictors and predicted, 1 new variable

targeted modeling

empiric combinational recoding

using simple clustering methods like KNN or K-means for a small number of assignates

distance from cluster centers and, or cluster center as new variables

targeted modeling

Deciling

upon evaluation of properties of the distribution

2 new variables

Collapsing

based on extreme-value quantiles

1 new variable, better distinction for data in frequent bins

optimal scaling

numerical encoding and/or rescaling using marginal sums

enhancing the scaling of the assignate from nominal to numerical

targeted modeling

relative risk weight

dto.

collapsing sets of sparsely filled variables

targeted modeling

Obviously, the transformation of data is not an analytical act, on both sides. Left-hand it refers to structural and hence semantic assumptions, while right hand it introduces hypotheses about those assumptions. Numbers are never ever just values, much like sentences and words do not consists just from letters. After all, the difference between both is probably less than one could initially presume. Later we will address this aspect from the opposite direction, when it comes to the translation of textual entities into numbers.

Time Series and Contexts

Time series data are the most valuable data. They allow the reconstruction of the flow of information in the observed system, either between variables intrinsic to the measurement setup (reflecting the “system”) or between treatment and effects. In the recent years, so-called “causal FFT” gained some popularity.

Yet, modeling time series data poses the same problematics as tabular data. We do not know apriori which variables to include, or how to transform variables in order to reflect particular parts of the information in the most suitable way. Simply pressing a FFT onto the data is nothing but naive. FFT assumes a harmonic oscillation, or a combination thereof, which certainly is not appropriate. Even if we interpret a long series of FFT terms as an approximation to an unknown function, it is by no means clear whether the then assumed stationarity26 is indeed present in the data.

Instead, it is more appropriate to represent the aspects of a time series in multiple ways. Often, there are many time series available, one for each assignate. This brings the additional problem of careful evaluation of cross-correlations and auto-correlations, and all of this under the condition that it is not known apriori whether the evolution of the system is stationary.

Fortunately, the analysis of multiple time series, even from non-stationary processes, is quite simple, if we follow the approach as outlined so far. Let us assume a set of assignates {a(i)} for which we have their time series measurement available, which are given by equidistant measurement points. A transformation then is constructed by a method m that is applied to a moving window of size md(k). All moving windows of any size are adjusted such that their endpoints meet at the measurement point at time t(m(k)). Let us call this point the prediction base point, T(p). The transformed values consist either from the residuals resulting from this methods values and the measurement data, or the parameters of the method fitted to the moving window. A example for the latter case are for instance given by the wavelet coefficients, which provide a quite suitable, multi-frequency perspective onto the development up to T(p). Of course, the time series data of different assignates could be related to each other by any arbitrary functional mapping.

The target value for the model could be any set of future points relative to t(m(k)). The model may predict a singular point, averages some time in the future, the volatility of the future development of the time series, or even the parameters of a particular mapping function relating several assignates. In the latter case the model would predict several criteria at once.

Such transformations yield a table that contain a lot more variables than originally available. The ratio may grow up to 1:100 in complex cases like the global financial markets. Just to be clear: If you measure, say the index values of 5 stock markets, some commodities like gold, copper, precious metals and “electronics metals”, the money market, bonds and some fundamentals alike, that is approx. 30 basic input variables, even a superficial analysis would have to inspect 3000 variables… Yes, learning and gaining experience can take quite a bit! Learning and experience do not become cheaper only for that we use machines to achieve it. Just exploring is more easy nowadays, not requiring life times any more. The reward consists from stable models about complex issues.

Each point in time is reflected by the original observational values and a lot of variables that express the most recent history relative to the point in time represented by the respective record. Any of the synthetic records thus may be interpreted as a set of hypothesis about the future development, where this hypothesis comes as a multidimensional description of the context up to T(p). It is then the task of the evolutionarily optimized variable selection based on the SOM to select the most appropriate hypothesis. Any subgroup contained in the SOM then represents comparable sets of relations between the past relative to T(p) and the respective future as it is operationalized into the target variable.

Typical transformations in such associative time series modeling are

  • – moving average and exponentially decaying moving average for de-seasoning or de-trending;
  • – various correlational methods: cross- and auto-correlation, including the result parameters of the Bartlett test;
  • – Wavelet-, FFT-, or Walsh- transforms of different order, residuals to the denoised reconstruction;
  • – fractal coefficients like Lyapunov coefficient or Hausdorff dimension
  • – ratios of simple regressions calculated over moving windows of different size;
  • – domain specific markers (think of technical stock market analysis, or ECG.

Once we have expressed a collection of time series as series of contexts preceding the prediction point T(p), the further modeling procedure does not differ from the modeling of ordinary tabular data, where the observations are independent from each other. From the perspective of our transformation tool, these time series transformation are nothing else than “methods”, they do not differ from other plugin methods with respect to the procedure calls in their programing interface.

„Unstructurable“ „Data“: Images and Texts

The last type of data for which we briefly would like to discuss the issue of transformation is “unstructurable” data. Images and texts are the main representatives for this class of entities. Why are these data “unstructurable”?

Let us answer this question from the perspective of textual analysis. Here, the reason is obvious, actually, there are several obvious reasons. Patrizia Violi [17] for instance emphasizes that words are creating their own context, upon which they are then going to be interpreted. Douglas Hofstadter extended the problematics to thinking at large, arguing that for any instance of analogical thinking—and any thinking he claimed as being analogical—it is impossible to define criteria that would allow to set up a table. Here on this site we argued repeatedly that it is not possible to define any criteria apriori that would capture the “meaning” of a text.

Else, understanding language, as well as understanding texts can’t be mapped to the problematics of predicting a time series. In language, there is no such thin as a prediction point T(p), and there is no positively definable “target” which could be predicted. The main reason for this is the special dynamics between context (background) and proposition (figure). It is a multi-level, multi-scale thing. It is ridiculous to apply n-grams to text, then hoping to catch anything “meaningful”. The same is true for any statistical measure.

Nevertheless, using language, that is, producing and understanding is based on processes that select and compose. In some way there must be some kind of modeling. We already proposed a structure, or more, an architecture, for this in a previous essay.

The basic trick consists of two moves: Firstly, texts are represented probabilistically as random contexts in an associative storage like the SOM. No variable selection takes place here, no modeling and no operationalization of a purpose is present. Secondly, this representation then is used as a basis for targeted modeling. Yet, the “content” of this representation does not consist from “language” data anymore. Strikingly different, it contains data about the relative location of language concepts and their sequence as they occur as random contexts in a text.

The basic task in understanding language is to accomplish the progress from a probabilistic representation to a symbolic tabular representation. Note that any tabular representation of an observation is already on the symbolic level. In the case of language understanding precisely this is not possible: We can’t define meaning, and above all, not apriori. Meaning appears as a consequence of performance and execution of certain rules to a certain degree. Hence we can’t provide the symbols apriori that would be necessary to set up a table for modeling, assessing “similarity” etc.

Now, instead of probabilistic non-structured representation we also could say arbitrary unstable structure. From this we should derive a structured, (proto-)symbolic and hence tabular and almost stable structure. The trick to accomplish this consists of using the modeling system itself as measurement device and thus also as a “root” for further reference in the then possible models. Kohonen and colleagues demonstrated this crucial step in their WebSom project. Unfortunately (for them), they then actualized several misunderstandings regarding modeling. For instance, they misinterpreted associative storage as a kind of model.

The nice thing with this architecture is that once the symbolic level has been achieved, any of the steps of our modeling approach can be applied without any change, including the automated transformation of “data” as described above.

Understanding the meaning of images follows the same scheme. The fact that there are no words renders the task more complicated and more simple at the same time. Note that so far there is no system that would have learned to “see”, to recognize and to understand images, despite many titles claim that the proposed “system” can do so. All computer vision approaches are analytic by nature, hence they are all deeply inadequate. The community is running straight into the method hell as the statisticians and the data miners did before, mistaking transformations as methods, conflating transformation and modeling, etc.. We discussed this issues at length above. Any of the approaches might be intelligently designed, but all are victimized by the representationalist fallacy, and probably even by naive realism. Due to the fact that the analytic approach is first, second and third mainstream, the probabilistic and contextual bottom-up approach is missing so far. In the same way as a word is not equal to the grapheme, a line is not defined on the symbolic level in the brain. We else and again meet the problem of analogical thinking even on the most primitive graphical level. When is a line still a line, when is a triangle still a triangle?

In order to start in the right way we first have to represent the physical properties of the image along different dimensions, such as textures, edges, or salient points, and all of those across different scales. Probably one can even detect salient objects by some analytic procedure. From any of the derived representations the random contexts are derived and arranged as vectors. A single image is represented as a table that contains random contexts derived from the image as a physical entity. From here on, the further processing scheme is the same as for texts. Note, that there is no such property as “line” in this basic mapping.

In case of texts and images the basic transformation steps thus consist in creating the representation as random contexts. Fortunately, this is “only” a question of the suitable plugins for our transformation tool. In both cases, for texts as well as images, the resulting vectors could grow considerably. Several thousands of implied variables must be expected. Again, there is already a solution, known as random projection, which allows to compress even very large vectors (say 20’000+) into one of say maximal 150 variables, without loosing much of the information that is needed to retain the distinctive potential. Random projection works by multiplying a vector of size N with a matrix of uniformly distributed random values of size NxM, which results in a vector of size M. Of course, M is chosen suitably (100+). The reason why this works is that with that many dimension all vectors are approximately orthogonal to each other! Of course, the resulting fields in such a vector do not “represent” anything that could be conceived as a reference to an “object”. Internally, however, that is from the perspective of a (population of) SOMs, it may well be used as a (almost) fixed “attribute”. Yet, neither the missing direct reference not the subjectivity poses a problem, as the meaning is not a mental entity anyway. Q.E.D.

Conclusion

Here in this essay we discussed several aspects related to the transformation of data as an epistemic activity. We emphasized that an appropriate attitude towards the transformation of data requires a shift in perspective and the focus of another vantage point. One of the more significant changes in attitude consider, perhaps, the drop of any positivist approach as one of the main pillars of traditional modeling. Remember that statistics is such a positivist approach. In our perspective, statistical methods are just transformations, nothing less, but above all also nothing more, characterized by a specific set of rather strong assumptions and conditions for their applicability.

We also provided some important practical examples for the transformation of data, whether tabular data derived from independent observations, time series data or “unstructurable” “data” like texts and images. According to the proposed approach we else described a prototypical architecture for a transformation tool, that could be used universally. In particular, it allows a complete automation of the modeling task, as it could be used for instance in the field of so-called data mining. The possibility for automated modeling is, of course, a fundamental requirement for any machine-based episteme.

Notes

1. The only reason why we do not refer to cultures and philosophies outside Europe is that we do not know sufficient details about them. Yet, I am pretty sure that taking into account Chinese or Indian philosophy would severe the situation.

2. It was Friedrich Schleiermacher who first observed that even the text becomes alien and at least partially autonomous to its author due to the necessity and inevitability of interpretation. Thereby he founded hermeneutics.

3. In German language these words all exhibit a multiple meaning.

4. In the last 10 years (roughly) it became clear that the gene-centered paradigms are not only not sufficient [2], they are even seriously defect. Evely Fox-Keller draws a detailed trace of this weird paradigm [3].

5. Michel Foucault [4]

6. The „axiom of choice“ is one of the founding axioms in mathematics. Its importance can’t be underestimated. Basically, it assumes that “something is choosable”. The notion of “something choosable” then is used to construct countability as a derived domain. This implies three consequences. First, this avoids to assume countability, that is, the effect of a preceding symbolification, as a basis for set theory. Secondly, it puts performance at the first place. These two implications render the “Axiom of Choice” into a doubly-articulated rule, offering two docking sites, one for mathematics, and one for philosophy. In some way, it thus can not count as an  “axiom”. Those implications are, for instance, fully compatible with Wittgenstein’s philosophy. For these reasons, Zermelo’s “axiom” may even serve as a shared point (of departure) for a theory of machine-based episteme. Finally, the third implication is that through the performance of the selection the relation, notably a somewhat empty relation is conceived as a predecessor of countability and the symbolic level. Interestingly, this also relates to Quantum Darwinism and String Theory.

7. David Grahame Shane’s theory on cities and urban entities [5] is probably the only theory in urbanism that is truly a relational theory.  Additionally, his work is full of relational techniques and concepts, such as the “heterotopy” (a term coined by Foucault).

8. Bruno Latour developed the Actor-Network-Theory [6,7], while Clarke evolved “Grounded Theory” into the concept of “Situational Analysis” [8]. Latour, as well as Clarke, emphasize and focus the relation as a significant entity.

9. behavioral coating, and behavioral surfaces ;

10. See Information & Causality about the relation between measurement, information and causality.

11. „Passivist“ refers to the inadequate form of realism according to which things exist as-such independently from interpretation. Of course, interpretation does affect the material dimension of a thing. Yet, it changes its relations insofar the relations of a thing, the Wittgensteinian “facts”, are visible and effective only if we assign actively significance to them. The “passivist” stance conceives itself as a re-construction instead of a construction (cf. Searle [9])

12. In [10] we developed an image theory in the context of the discussion about the mediality of facades of buildings.

13. nonsense of „non-supervised clustering“

14. In his otherwise quite readable book [11], though it may serve only as an introduction.

15. This can be accomplished by using a data segment for which the implied risk equals 0 (positive predictive value = 1). We described this issue in the preceding chapter.

16. hint to particle physics…

17. See our previous essay about the complementarity of the concepts of causality and information.

18. For an introduction of renormalization (in physics) see [12], and a bit more technical [13]

19. see the Wiki entry about so-called gravitational lenses.

20. Catastrophe theory is a concept invented and developed by French mathematician Rene Thom as a field of Differential Topology. cf. [14]

21.  In their book, Witten & Eibe [15] recognized the importance of transformation and included a dedicated chapter about it. They also explicitly mention the creation of synthetic variables. Yet, they do also explicitly retreat from it as a practical means for the reason of computational complexity (=here, the time needed to perform a calculation in relation to the amount of data). After all, their attitude towards transformation is somehow that towards an unavoidable evil. They do not recognize its full potential. After all, as a cure for the selection problem, they propose SVM and their hyperplanes, which is definitely a poor recommendation.

22. Dorian Pyle [11]

23. see Benoit Mandelbrot [16].

24. By using almost meaningless labels target-oriented modeling is often called supervised modeling as opposed to “non-supervised modeling”, where no target variable is being used. Yet, such a modeling is not a model, since the pragmatics of the concept of “model” invariably requires a purpose.

25. About assignates: often called property, or feature… see about modeling

26. Stationarity is a concept in empirical system analysis or description, which denotes the expectation that the internal setup of the observed process will not change across time within the observed period. If a process is rated as “stationary” upon a dedicated test, one could select a particular, and only one particular method or model to reflect the data. Of course, we again meet the chicken-egg problem. We can decide about stationarity only by means of a completed model, that is after the analysis. As a consequence, we should not use linear methods, or methods that depend on independence, for checking the stationarity before applying the “actual” method. Such a procedure can not count as a methodology at all. The modeling approach should be stable against non-stationarity. Yet, the problem of the reliability of the available data sample remains, of course. As a means to “robustify” the resulting model against the unknown future one can apply surrogating. Ultimately, however, the only cure is a circular, or recurrent methodology that incorporates learning and adaptation as a structure, not as a result.

References
  • [1] Robert Rosen, Life Itself: A Comprehensive Inquiry into the Nature, Origin, and Fabrication of Life. Columbia University Press, New York 1991.
  • [2] Nature Insight: Epigenetics, Supplement Vol. 447 (2007), No. 7143 pp 396-440.
  • [3] Evelyn Fox Keller, The Century of the Gene. Harvard University Press, Boston 2002. see also: E. Fox Keller, “Is There an Organism in This Text?”, in P. R. Sloam (ed.), Controlling Our Destinies. Historical, Philosophical, Ethical, and Theological Perspectives on the Human Genome Project, Notre Dame (Indiana), University of Notre Dame Press, 2000, pp. 288-289
  • [4] Michel Foucault, Archeology of Knowledge. 1969.
  • [5] David Grahame Shane. Recombinant Urbanism: Conceptual Modeling in Architecture, Urban Design and City Theory
  • [6] Bruno Latour. Reassembling The Social. Oxford University Press, Oxford 2005.
  • [7] Bruno Latour (1996). On Actor-network Theory. A few Clarifications. in: Soziale Welt 47, Heft 4, p.369-382.
  • [8] Adele E. Clarke, Situational Analysis: Grounded Theory after the Postmodern Turn. Sage, Thousand Oaks, CA 2005).
  • [9] John R. Searle, The Construction of Social Reality. Free Press, New York 1995.
  • [10] Klaus Wassermann & Vera Bühlmann, Streaming Spaces – A short expedition into the space of media-active façades. in: Christoph Kronhagel (ed.), Mediatecture, Springer, Wien 2010. pp.334-345. available here
  • [11] Dorian Pyle, Data Preparation for Data Mining. Morgan Kaufmann, San Francisco 1999.
  • [12] John Baez (2009). Renormalization Made Easy. Webpage
  • [13] Bertrand Delamotte (2004). A hint of renormalization. Am.J.Phys. 72: 170-184. available online.
  • [14] Tim Poston & Ian Stewart, Catastrophe Theory and Its Applications. Dover Publ. 1997.
  • [15] Ian H. Witten & Frank Eibe, Data Mining. Practical Machine Learning Tools and Techniques (2nd ed.). Elsevier, Oxford 2005.
  • [16] Benoit Mandelbrot & Richard L. Hudson, The (Mis)behavior of Markets. Basic Books, New York 2004.
  • [17] Patrizia Violi (2000). Prototypicality, typicality, and context. in: Liliana Albertazzi (ed.), Meaning and Cognition – A multidisciplinary approach. Benjamins Publ., Amsterdam 2000. p.103-122.

۞

Prolegomena to a Morphology of Experience

May 2, 2012 § Leave a comment

Experience is a fundamental experience.

The very fact of this sentence demonstrates that experience differs from perception, much like phenomena are different from objects. It also demonstrates that there can’t be an analytic treatment or even solution of the question of experience. Experience is not only related to sensual impressions, but also to affects, activity, attention1 and associations. Above all, experience is deeply linked to the impossibility to know anything for sure or, likewise, apriori. This insight is etymologically woven into the word itself: in Greek, “peria” means “trial, attempt, experience”, influencing also the roots of “experiment” or “peril”.

In this essay we will focus on some technical aspects that are underlying the capability to experience. Before we go in medias res, I have to make clear the rationale for doing so, since, quite obviously so, experience could not be reduced to those said technical aspects, to which for instance modeling belongs. Experience is more than the techné of sorting things out [1] and even more than the techné of the genesis of discernability, but at the same time it plays a particular, if not foundational role in and for the epistemic process, its choreostemic embedding and their social practices.

Epistemic Modeling

As usual, we take the primacy of interpretation as one of transcendental conditions, that is, it is a condition we can‘t go beyond, even on the „purely“ material level. As a suitable operationalization of this principle, still a quite abstract one and hence calling for situative instantiation, we chose the abstract model. In the epistemic practice, the modeling does not, indeed, even never could refer to data that is supposed to „reflect“ an external reality. If we perform modeling as a pure technique, we are just modeling, but creating a model for whatsoever purpose, so to speak „modeling as such“, or purposed modeling, is not sufficient to establish an epistemic act, which would include the choice of the purpose and the choice of the risk attitude. Such a reduction is typical for functionalism, or positions that claim a principle computability of epistemic autonomy, as for instance the computational theory of mind does.

Quite in contrast, purposed modeling in epistemic individuals already presupposes the transition from probabilistic impressions to propositional, or say, at least symbolic representation. Without performing this transition from potential signals, that is mediated „raw“ physical fluctuations in the density of probabilities, to the symbolic it is impossible to create a structure, let it be for instance a feature vector as a set of variably assigned properties, „assignates“, as we called it previously. Such a minimal structure, however, is mandatory for purposed modeling. Any (re)presentation of observations to a modeling methods thus is already subsequent to prior interpretational steps.

Our abstract model that serves as an operationalization of the transcendental principle of the primacy of interpretation thus must also provide, or comprise, the transition from differences into proto-symbols. Proto-symbols are not just intensions or classes, they are so to speak non-empiric classes that have been derived from empiric ones by means of idealization. Proto-symbols are developed into symbols by means of the combination of naming and an associated practice, i.e a repeating or reproducible performance, or still in other words, by rule-following. Only on the level of symbols we then may establish a logic, or claiming absolute identity. Here we also meet the reason for the fact that in any real-world context a “pure” logic is not possible, as there are always semantic parts serving as a foundation of its application. Speaking about “truth-values” or “truth-functions” is meaningless, at least. Clearly, identity as a logical form is a secondary quality and thus quite irrelevant for the booting of the capability of experience. Such extended modeling is, of course, not just a single instance, it is itself a multi-leveled thing. It even starts with the those properties of the material arrangement known as body that allow also an informational perspective. The most prominent candidate principle of such a structure is the probabilistic, associative network.

Epistemic modeling thus consists of at least two abstract layers: First, the associative storage of random contexts (see also the chapter “Context” for their generalization), where no purpose is implied onto the materially pre-processed signals, and second, the purposed modeling. I am deeply convinced that such a structure is only way to evade the fallacy of representationalism2. A working actualization of this abstract bi-layer structure may comprise many layers and modules.

Yet, once one accepts the primacy of interpretation, and there is little to say against it, if anything at all, then we are lead directly to epistemic modeling as a mandatory constituent of any interpretive relationship to the world, for primitive operations as well as for the rather complex mental life we experience as humans, with regard to our relationships to the environment as well as with regard to our inner reality. Wittgenstein emphasized in his critical solipsism that the conception of reality as inner reality is the only reasonable one [3]. Epistemic modeling is the only way to keep meaningful contact with the external surrounds.

The Bridge

In its technical parts experience is based on an actualization of epistemic modeling. Later we will investigate the role and the usage of these technical parts in detail. Yet, the gap between modeling, even if conceived as an abstract, epistemic modeling, and experience is so large that we first have to shed some light on the bridge between these concepts. There are some other issues with experience than just the mere technical issues of modeling that are not less relevant for the technical issues, too.

Experience comprises both more active and more passive aspects, both with regard to performance and to structure. Both dichotomies must not be taken as ideally separated categories, of course. Else, the basic distinction into active and passive parts is not a new one either. Kant distinguished receptivity and spontaneity as two complementary faculties that combine in order to bring about what we call cognition. Yet, Leibniz, in contrast, emphasized the necessity of activity even in basic perception; nowadays, his view has been greatly confirmed by the research on sensing in organic (animals) as well as in in-organic systems (robots). Obviously, the relation between activity and passivity is not a simple one, as soon as we are going to leave the bright spheres of language.3

In the structural perspective, experience unfolds in a given space that we could call the space of experiencibility4. That space is spanned, shaped and structured by open and dynamic collections of any kind of theory, model, concept or symbol as well as by the mediality that is “embedding” those. Yet, experience also shapes this space itself. The situation reminds a bit to the relativistic space in physics, or the social space in humans, where the embedding of one space into another one will affect both participants, the embedded as well as the embedding space. These aspects we should keep in mind for our investigation of questions about the mechanisms that contribute to experience and the experience of experience. As you can see, we again refute any kind of ontological stances even to their smallest degrees.5

Now when going to ask about experience and its genesis, there are two characteristics of experience that enforce us to avoid the direct path. First, there is the deep linkage of experience to language. We must get rid of language for our investigation in order to avoid the experience of finding just language behind the language or what we call upfront “experience”; yet, we also should not forget about language either. Second, there is the self-referentiality of the concept of experience, which actually renders it into a strongly singular term. Once there are even only tiny traces of the capability for experience, the whole game changes, burying the initial roots and mechanisms that are necessary for the booting of the capability.

Thus, our first move consists in a reduction and linearization, which we have to catch up with later again, of course. We will achieve that by setting everything into motion, so-to-speak. The linearized question thus is heading towards the underlying mechanisms6:

How do we come to believe that there are facts in the world? 7

What are—now viewed from the outside of language8—the abstract conditions and the practiced moves necessary and sufficient for the actualization­­ of such statements?

Usually, the answer will refer to some kind of modeling. Modeling provides the possibility for the transition from the extensional epistemic level of particulars to the intensional epistemic level of classes, functions or categories. Yet, modeling does not provide sufficient reason for experience. Sure, modeling is necessary for it, but it is more closely related to perception, though also not being equivalent to it. Experience as a kind of cognition thus can’t be conceived as kind of a “high-level perception”, quite contrary to the suggestion of Douglas Hofstadter [4]. Instead, we may conceive experience, in a first step, as the result and the activity around the handling of the conditions of modeling.

Even in his earliest writings, Wittgenstein prominently emphasized that it is meaningless to conceive of the world as consisting from “objects”. The Tractatus starts with the proposition:

The world is everything that is the case.

Cases, in the Tractatus, are states of affairs that could be made explicit into a particular (logical) form by means of language. From this perspective one could derive the radical conclusion that without language there is no experience at all. Despite we won’t agree to such a thesis, language is a major factor contributing to some often unrecognized puzzles regarding experience. Let us very briefly return to the issue of language.

Language establishes its own space of experiencibility, basically through its unlimited expressibility that induces hermeneutic relationships. Probably mainly to this particular experiential sphere language is blurring or even blocking clear sight to the basic aspects of experience. Language can make us believe that there are phenomena as some kind of original stuff, existing “independently” out there, that is, outside the human cognition.9 Yet, there is no such thing like a phenomenon or even an object that would “be” before experience, and for us humans even not before or outside of language. It is even not reasonable to speak about phenomena or objects as if they would exist before experience. De facto, it is almost non-sensical to do so.

Both, objects as specified entities and phenomena at large are consequences of interpretation, in turn deeply shaped by cultural imprinting, and thus heavily depending on language. Refuting that consequence would mean to refute the primacy of interpretation, which would fall into one of the categories of either naive realism or mysticism. Phenomenology as an ontological philosophical discipline is nothing but a mis-understanding (as ontology is henceforth); since phenomenology without ontological parts must turn into some kind of Wittgensteinian philosophy of language, it simply vanishes. Indeed, when already being teaching in Cambridge, Wittgenstein once told a friend to report his position to the visiting Schlick, whom he refused to meet on this occasion, as “You could say of my work that it is phenomenology.” [5] Yet, what Wittgenstein called “phenomenology” is completely situated inside language and its practicing, and despite there might be a weak Kantian echo in his work, he never supported Husserl’s position of synthetic universals apriori. There is even some likelihood that Wittgenstein, strongly feeling to be constantly misunderstood by the members of the Vienna Circle, put this forward in order to annoy Schlick (a bit), at least to pay him back in kind.

Quite in contrast, in a Wittgensteinian perspective facts are sort of collectively compressed beliefs about relations. If everybody believes to a certain model of whatever reference and of almost arbitrary expectability, then there is a fact. This does not mean, however, that we get drowned by relativism. There are still the constraints implied by the (unmeasured and unmeasurable) utility of anticipation, both in its individual and its collective flavor. On the other hand, yes, this indeed means that the (social) future is not determined.

More accurately, there is at least one fact, since the primacy of interpretation generates at least the collectivity as a further fact. Since facts are taking place in language, they do not just “consist” of content (please excuse such awful wording), there is also a pragmatics, and hence there are also at least two different grammars, etc.etc.

How do we, then, individually construct concepts that we share as facts? Even if we would need the mediation by a collective, a large deal of the associative work takes place in our minds. Facts are identifiable, thus distinguishable and enumerable. Facts are almost digitized entities, they are constructed from percepts through a process of intensionalization or even idealization and they sit on the verge of the realm of symbols.

Facts are facts because they are considered as being valid, let it be among a collective of people, across some period of time, or a range of material conditions. This way they turn into kind of an apriori from the perspective of the individual, and there is only that perspective. Here we find the locus situs of several related misunderstandings, such as direct realism, Husserlean phenomenology, positivism, the thing as such, and so on. The fact is even synthetic, either by means of “individual”10 mental processes or by the working of a “collective reasoning”. But, of course, it is by no means universal, as Kant concluded on the basis of Newtonian science, or even as Schlick did in 1930 [6]. There is neither a universal real fact, nor a particular one. It does not make sense to conceive the world as existing from independent objects.

As a consequence, when speaking about facts we usually studiously avoid the fact of risk. Participants in the “fact game” implicitly agree on the abandonment of negotiating affairs of risk. Despite the fact that empiric knowledge never can be considered as being “safe” or “secured”, during the fact game we always behave as if. Doing so is the more or less hidden work of language, which removes the risk (associated with predictive modeling) and replaces it by metaphorical expressibility. Interestingly, here we also meet the source field of logic. It is obvious (see Waves & Words) that language is neither an extension of logics, nor is it reasonable to consider it as a vehicle for logic, i.e. for predicates. Quite to the contrast, the underlying hypothesis is that (practicing) language and (weaving) metaphors is the same thing.11 Such a language becomes a living language that (as Gier writes [5])

“[…] grows up as a natural extension of primitive behavior, and we can count on it most of the time, not for the univocal meanings that philosophers demand, but for ordinary certainty and communication.”

One might just modify Gier’s statement a bit by specifying „philosophers“ as idealistic, materialistic or analytic philosophers.

In “On Certainty” (OC, §359), Wittgenstein speaks of language as expressing primitive behavior and contends that ordinary certainty is “something animal”. This now we may take as a bridge that provides the possibility to extend our asking about concepts and facts towards the investigation of the role of models.

Related to this there is a pragmatist aspect that is worthwhile to be mentioned. Experience is a historicizing concept, much like knowledge. Both concepts are meaningful only in hindsight. As soon as we consider their application, we see that both of them refer only to one half of the story that is about the epistemic aspects of „life“. The other half of the epistemic story and directly implied by the inevitable need to anticipate is predictive or, equivalently, diagnostic modeling. Abstract modeling in turn implies theory, interpretation and orthoregulated rule-following.

Epistemology thus should not be limited to „knowledge“, the knowable and its conditions. Epistemology has explicitly to include the investigation of the conditions of what can be anticipated.

In a still different way we thus may repose the question about experience as the transition from epistemic abstract modeling to the conditions of that modeling. This would include the instantiation of practicable models as well as the conditions for that instantiation, and also the conditions of the application of models.In technical terms this transition is represented by a problematic field: The model selection problem, or in more pragmatic terms, the model (selection) risk.

These two issues, the prediction task and the condition of modeling now form the second toehold of our bridge between the general concept of experience and some technical aspects of the use of models. There is another bridge necessary to establish the possibility of experience, and this one connects the concept of experience with languagability.

The following list provides an overview about the following chapters:

These topics are closely related to each other, indeed so closely that other sequences would be justifiable too. Their interdependencies also demand a bit of patience from you, the reader, as the picture will be complete only when we arrive at the results of modeling.

A last remark may be allowed before we start to delve into these topics. It should be clear by now that any kind of phenomenology is deeply incompatible with the view developed here. There are several related stances, e.g. the various shades of ontology, including the objectivist conception of substance. They are all rendered as irrelevant and inappropriate for any theory about episteme, whether in its machine-based form or regarding human culture, whether as practice or as reflecting exercise.

The Modeling Statement

As the very first step we have to clearly state the goal of modeling. From the outside that goal is pretty clear. Given a set of observations and the respective outcomes, or targets, create a mapping function such that the observed data allow for a reconstruction of the outcome in an optimized manner. Finding such a function can be considered as a simple form of learning if the function is „invented“. In most cases it is not learning but just the estimation of pre-defined parameters.12 In a more general manner we also could say that any learning algorithm is a map L from data sets to a ranked list of hypothesis functions. Note that accuracy is only one of the possible aspects of that optimization. Let us call this for convenience the „outer goal“ of modeling. Would such mapping be perfect within reasonable boundaries, we would have found automatically a possible transition from probabilistic presentation to propositional representation. We could consider the induction of a structural description from observations as completed. So far the secret dream of Hans Reichenbach, Carl Schmid-Hempel, Wesley Salmon and many of their colleagues.

The said mapping function will never be perfect. The reasons for this comprise the complexity of the subject, noise in the measured data, unsuitable observables or any combinations of these. This induces a wealth of necessary steps and, of course, a lot of work. In other words, a considerable amount of apriori and heuristic choices have to be taken. Since a reliable, say analytic mapping can’t be found, every single step in the value chain towards the model at once becomes questionable and has to be checked for its suitability and reliability. It is also clear that the model does not comprise just a formula. In real-world situations a differential modeling should be performed, much like in medicine a diagnosis is considered to be complete only if a differential diagnosis is included. This comprises the investigation of the influence of the method’s parameterization onto the results. Let us call the whole bunch of respective goals the „inner goals“ of modeling.

So, being faced with the challenge of such empirical mess, how does the statement about the goals of the „inner modeling“ look like? We could for instance demand to remove the effects of the shortfalls mentioned above, which cause the imperfect mapping: complexity of the subject, noise in the measured data, or unsuitable observables.

To make this more concrete we could say, that the inner goals of modeling consist in a two-fold (and thus synchronous!) segmentation of the data, resulting in the selection of the proper variables and in the selection of the proper records, where this segmentation is performed under conditions of a preceding non-linear transformation of the embedding reference system. Ideally, the model identifies the data for which it is applicable. Only for those data then a classification is provided. It is pretty clear that this statement is an ambitious one. Yet, we regard it as crucial for any attempt to step across our epistemic bridge that brings us from particular data to the quality of experience. This transition includes something that is probably better known by the label „induction“. Thus, we finally arrive at a short statement about the inner goals of modeling:

How to conclude and what to conclude from measured data?

Obviously, if our data are noisy and if our data include irrelevant values any further conclusion will be unreliable. Yet, for any suitable segmentation of the data we need a model first. From this directly follows that a suitable procedure for modeling can’t consist just from a single algorithm, or a „one-shot procedure“. Any instance of single-step approaches are suffering from lots of hidden assumptions that influence the results and its properties in unforeseeable ways. Modeling that could be regarded as more than just an estimation of parameters by running an algorithm is necessarily a circular and—dependent on the amount of variables­—possibly open-ended process.

Predictability and Predictivity

Let us assume a set of observations S obtained from an empirical process P. Then ­­­this process P should be called “predictable” if the results of the mapping function f(m) that serves as an instance of a hypothesis h from the space of hypotheses H coincides with the outcome of the process P in such a way that f(m) forms an expectation with a deviation d<ε for all f(m). In this case we may say that f(m) predicts P. This deviation is also called “empirical risk”, and the purpose of modeling is often regarded as minimizing the empirical risk (ERM).

There are then two important questions. Firstly, can we trust f(m), since f(m) has been built on a limited number of observations? Secondly, how can we make f(m) more trustworthy, given the limitation regarding the data? Usually, these questions are handled under the label of validation. Yet, validation procedures are not the only possible means to get an answer here. It would be a misunderstanding to think that it is the building or construction of a model that is problematic.

The first question can be answered only by considering different models. For obtaining a set of different models we could apply different methods. That would be o.k. if prediction would be our sole interest. Yet, we also strive for detecting structural insights. And from that perspective we should not, of course, use different methods to get different models. The second possibility for addressing the first question is to use different sub-samples, which turns simple validation into a cross-validation. Cross-validation provides an expectation for the error (or the risk). Yet, in order to compare across methods one actually should describe the expected decrease in “predictive power”13 for different sample sizes (independent cross-validation per sample size). The third possibility for answering question (1) is related to the the former and consists by adding noised, surrogated (or simulated) data. This prevents the learning mechanism from responding to empirically consistent, but nevertheless irrelevant noisy fluctuations in the raw data set. The fourth possibility is to look for models of equivalent predictive power, which are, however, based on a different set of predicting variables. This possibility is not accessible for most statistical approaches such like Principal Component Analysis (PCA). Whatever method is used to create different models, models may be combined into a “bag” of models (called “bagging”), or, following an even more radical approach, into an ensemble of small and simple models. This is employed for instance in the so-called Random Forest method.

Commonly, if a model passes cross-validation successfully, it is considered to be able to “generalize”. In contrast to the common practice, Poggio et al. [7] demonstrated that standard cross-validation has to be extended in order to provide a characterization of the capability of a model to generalize. They propose to augment

CV1oo stability with stability of the expected error and stability of the empirical error to define a new notion of stability, CVEEE1oo stability.

This makes clear that Poggio’s et al. approach is addressing the learning machinery, not any longer just the space of hypotheses. Yet, they do not take the free parameters of the method into account. We conclude that their proposed approach still remains an uncritical approach. Thus I would consider such a model as not completely trustworthy. Of course, Poggio et al. are definitely pointing towards the right direction. We recognize a move away from naive realism and positivism, instead towards a critical methodology of the conditional. Maybe, philosophy and natural sciences find common grounds again by riding the information tiger.

Checking the stability of the learning procedure leads to a methodology that we called “data experiments” elsewhere. The data experiments do NOT explore the space of hypotheses, at least not directly. Instead they create a map for all possible models. In other words, instead of just asking about the predictability we now ask about the differential predictivity of in the space of models.

From the perspective of a learning theory Poggio’s move can’t be underestimated. Statistical learning theory (SLT)[8] explicitly assumes that a direct access to the world is possible (via identity function, perfectness of the model). Consequently, SLT focuses (only) on the reduction of the empirical risk. Any learning mechanism following the SLT is hence uncritical about its own limitation. SLT is interested in the predictability of the system-as-such, thereby not rather surprisingly committing the mistake of pre-19th century idealism.

The Independence Assumption

The independence assumption [I.A.], or linearity assumption, acts mainly on three different targets. The first of them is the relationship between observer and observed, while its second target is the relationship between observables. The third target finally regards the relation between individual observations. This last aspect of the I.A. is the least problematic one. We will not discuss this any further.

Yet, the first and the second one are the problematic ones. The I.A. is deeply buried into the framework of statistics and from there it made its way into the field of explorative data analysis. There it can be frequently met for instance in the geometrical operationalization of similarity, the conceptualization of observables as Cartesian dimensions or independent coefficients in systems of linear equations, or as statistical kernels in algorithms like the Support Vector Machine.

Of course, the I.A. is just one possible stance towards the treatment of observables. Yet, taking it as an assumption we will not include any parameter into the model that reflects the dependency between observables. Hence, we will never detect the most suitable hypothesis about the dependency between observables. Instead of assuming the independence of variables throughout an analysis it would be methodological much more sound to address the degree of dependency as a target. Linearity should not be an assumption, it should be a result of an analysis.

The linearity or independence assumption carries another assumption with it under its hood: the assumption of the homogeneity of variables. Variables, or assignates, are conceived as black-boxes, with unknown influence onto the predictive power of the model. Yet, usually they exert very different effects on the predictive power of a model.

Basically, it is very simple. The predictive power of a model depends on the positive predictive value AND the negative predictive value, of course; we may also use closely related terms sensitivity and specificity. Accordingly, some variables contribute more to the positive predictive value, other help to increase the negative predictive value. This easily becomes visible if we perform a detailed type-I/II error analysis. Thus, there is NO way to avoid testing those combinations explicitly, even if we assume the initial independence of variables.

As we already mentioned above, the I.A. is just one possible stance towards the treatment of observables. Yet, its status as a methodological sine qua non that additionally is never reflected upon renders it into a metaphysical assumption. It is in fact an irrational assumption, which induces serious costs in terms of the structural richness of the results. Taken together, the independence assumption represents one of the most harmful habits in data analysis.

The Model Selection Problem

In the section “Predictability and Predictivity” above we already emphasized the importance of the switch from the space of hypotheses to the space of models. The model space unfolds as a condition of the available assignates, the size of the data set and the free parameters of the associative (“modeling”) method. The model space supports a fundamental change of the attitude towards a model. Based on the denial of the apriori assumption of independence of observables we identified the idea of a singular best model as an ill-posed phantasm. We thus move onwards from the concept of a model as a mapping function towards ensembles of structurally heterogeneous models that together as a distinguished population form a habitat, a manifold in the sphere of the model space. With such a structure we neither need to arrive at a single model.

Methods, Models, Variables

The model selection problem addresses two sets of parameters that are actually quite different from each other. Model selection should not be reduced to the treatment of the first set, of course, as it happens at least implicitly for instance in [9]. The first set refers to the variables as known from the data, sometimes also called the „predictors“. The selection of the suitable variables is the first half of the model selection problem. The second set comprises all free parameters of the method. From the methodological point of view, this second set is much more interesting than the first one. The method’s parameters are apriori conditions to the performance of the method, which additionally usually remain invisible in the results, in contrast to the selection of variables.

For associative methods like SOM or other clustering variables the effect of de-/selecting variables can be easily described. Just take all the objects in front of you, for instance on the table, or in your room. Now select an arbitrary purpose and assign this purpose as a degree of support to those objects. For now, we have constructed the target. Now we go “into” the objects, that is, we describe them by a range of attributes that are present in most of the objects. Dependent on the selection of  a subset from these attributes we will arrive at very different groups. The groups now represent the target more or less, that’s the quality of the model. Obviously, this quality differs across the various selections of attributes. It is also clear that it does not help to just use all attributes, because some of the attributes just destroy the intended order, they add noise to the model and decrease its quality.

As George observes [10], since its first formulation in the 1960ies a considerable, if not large number of proposals for dealing with the variable selection problem have been proposed. Although George himself seem to distinguish the two sets of parameters, throughout the discussion of the different approaches he always refers just to the first set, the variables as included in the data. This is not a failure of the said author, but a problem of the statistical approach. Usually, the parameters of statistical procedures are not accessible, as any analytic procedure, they work as they work. In contrast to Self-organizing Maps, and even to Artificial Neural Networks (ANN) or Genetic Procedures, analytic procedures can’t be modified in order to achieve a critical usage. In some way, with their mono-bloc design they perfectly fit into representationalist fallacy.

Thus, using statistical (or other analytic) procedures, the model selection problem consists of the variable selection problem and the method selection problem. The consequences are catastrophic: If statistical methods are used in the context of modeling, the whole statistical framework turns into a black-box, because the selection of a particular method can’t be justified in any respect. In contrast to that quite unfavorable situation, methods like the Self-Organizing Map provide access to any of its parameters. Data experiments are only possible with methods like SOM or ANN. Not the SOM or the ANN are „black-boxes“, but the statistical framework must be regarded as such. Precisely this is also the reason for the still ongoing quarrels about the foundations of the statistical framework. There are two parties, the frequentists and the bayesians. Yet, both are struck by the reference class problem [11]. From our perspective, the current dogma of empirical work in science need to be changed.

The conclusion is that statistical methods should not be used at all to describe real-world data, i.e. for the modeling of real-world processes. They are suitable only within a fully controlled setting, that is, within a data experiment. The first step in any kind of empirical analysis thus must consist of a predictive modeling that includes the model selection task.14

The Perils of Universalism

Many people dealing with the model selection task are mislead by a further irrational phantasm, caused by a mixture of idealism and positivism. This is the phantasm of the single best model for a given purpose.

Philosophers of science long ago recognized, starting with Hume and ultimately expressed by Quine, that empirical observations are underdetermined. The actual challenge posed by modeling is given by the fact of empirical underdetermination. Goodman felt obliged to construct a paradox from it. Yet, there is no paradox, there is only the phantasm  of the single best model. This phantasm is a relic from the Newtonian period of science, where everybody thought the world is made by God as a miraculous machine, everything had to be well-defined, and persisting contradictions had to be rated as evil.

Secondarily, this moults into the affair of (semantic) indetermination. Plainly spoken, there are never enough data. Empirical underdetermination results in the actuality of strongly diverging models, which in turn gives rise to conflicting experiences. For a given set of data, in most cases it is possible to build very different models (ceteris paribus, choosing different sets of variables) that yield the same utility, or say predictive power, as far as this predictive power can be determined by the available data sample at all. Such ceteris paribus difference will not only give rise to quite different tracks of unfolding interpretation, it is also certainly in the close vicinity of Derrida’s deconstruction.

Empirical underdetermination thus results in a second-order risk, the model selection risk. Actually, the model selection risk is the only relevant risk. We can’t change the available data, and data are always limited, sometimes just by their puniness, sometimes by the restrictions to deal with them. Risk is not attached to objects or phenomena, because objects “are not there” before interpretation and modeling. Risk is attached only to models. Risk is a particular state of affair, and indeed a rather fundamental one. Once a particular model would tell us that there is an uncertainty regarding the outcome, we could take measures to deal with that uncertainty. For instance, we hedge it, or organize some other kind of insurance for it. But hedging has to rely on the estimation of the uncertainty, which is dependent on the expected predictive power of the model, not just the accuracy of the model given the available data from a limited sample.

Different, but equivalent selections of variables can be used to create a group of models as „experts“ on a given task to decide on. Yet, the selection of such „experts“ is not determinable on the basis of the given data alone. Instead, further knowledge about the relation of the variables to further contexts or targets needs to be consulted.

Universalism is usually unjustifiable, and claiming it instead usually comes at huge costs, caused by undetectable blindnesses once we accept it. In contemporary empiricism, universalism—and the respecting blindness—is abundant also with regard to the role of the variables. What I am talking about here is context, mediality and individuality, which, from a more traditional formal perspective, is often approximated by conditionality. Yet, it more and more becomes clear that the Bayesian mechanisms are not sufficient to get the complexity of the concept of variables covered. Just to mention the current developments in the field of probability theory I would like to refer to Brian  Weatherson, who favors and develops the so-called dynamic Keynesian models of uncertainty. [10] Yet, we regard this only as a transitional theory, despite the fact that it will have a strong impact on the way scientists will handle empiric data.

The mediating individuality of observables (as deliberately chosen assignates, of course) is easy to observe, once we drop the universalism qua independence of variables. Concerning variables, universalism manifests in an indistinguishability of the choices made to establish the assignates with regard to their effect onto the system of preferences. Some criteria C will induce the putative objects as distinguished ones only, if another assignate A has pre-sorted it. Yet, it would be a simplification to consider the situation in the Bayesian way as P(C|A). The problem with it is that we can’t say anything about the condition itself. Yet, we need to “play” (actually not “control”) with the conditionability, the inner structure of these conditions. As it is with the “relation,” which we already generalized into randolations, making it thereby measurable, we also have to go into the condition itself in order to defeat idealism even on the structural level. An appropriate perspective onto variables would hence treat it as a kind of media. This mediality is not externalizable, though, since observables themselves precipitate from the mediality, then as assignates.

What we can experience here is nothing else than the first advents of a real post-modernist world, an era where we emancipate from the compulsive apriori of independence (this does not deny, of course, its important role in the modernist era since Descartes).

Optimization

Optimizing a model means to select a combination of suitably valued parameters such that the preferences of the users in terms of risk and implied costs are served best. The model selection problem is thus the link between optimization problems, learning tasks and predictive modeling. There are indeed countless many procedures for optimization. Yet, the optimization task in the context of model selection is faced with a particular challenge: its mere size. George begins his article in the following way:

A distinguishing feature of variable selection problems is their enormous size. Even with moderate values of p, computing characteristics for all 2p models is prohibitively expensive and some reduction of the model space is needed.

Assume for instance a data set that comprises 50 variables. From that 1.13e15 models are possible, and assume further that we could test 10‘000 models per second, then we still would need more than 35‘000 years to check all models. Usually, however, building a classifier on a real-world problem takes more than 10 seconds, which would result in 3.5e9 years in the case of 50 variables. And there are many instances where one is faced with much more variables, typically 100+, and sometimes going even into the thousands. That’s what George means by „prohibitively“.

There are many proposals to deal with that challenge. All of them fall into three classes: they use either (1) some information theoretic measure (AIC, BIC, CIC etc. [11]), or (2) they use likelihood estimators, i.e. they conceive of parameters themselves as random variables, or (3) they are based of probabilistic measures established upon validation procedures. Particularly the instances from the first two of those classes are hit by the linearity and/or the independence assumption, and also by unjustified universalism. Of course, linearity should not be an assumption, it should be a result, as we argued above. Hence, there is no way to avoid the explicit calculation of models.

Given the vast number of combinations of symbols it appears straightforward to conceive of the model selection problem from an evolutionary perspective. Evolution always creates appropriate and suitable solutions from the available „evolutionary model space“. That space is of size 230‘000 in the case of humans, which is a „much“ larger number than the number of species ever existent on this planet. Not a single viable configuration could have been found by pure chance. Genetics-based alignment and navigation through the model space is much more effective than chance. Hence, the so-called genetic algorithms might appear on the radar as the method of choice .

Genetics, revisited

Unfortunately, for the variable selection problem genetic algorithms15 are not suitable. The main reason for this is still the expensive calculation of single models. In order to set up the genetic procedure, one needs at least 500 instances to form the initial population. Any solution for the variable selection problem should arrive at a useful solution with less than 200 explicitly calculated models. The great advantage of genetic algorithms is their capability to deal with solution spaces that contain local extrema. They can handle even solution spaces that are inhomogeneously rugged, simply for the reason that recombination in the realm of the symbolic does not care about numerical gradients and criteria. Genetic procedures are based on combinations of symbolic encodings. The continuous switch between the symbolic (encoding) and the numerical (effect) are nothing else than the pre-cursors of the separation between genotypes and phenotypes, without which there would not be even simple forms of biological life.

For that reason we developed a specialized instantiation of the evolutionary approach (implemented in SomFluid). Described very briefly we can say that we use evolutionary weights as efficient estimators of the maximum likelihood of parameters. The estimates are derived from explicitly calculated models that vary (mostly, but not necessarily ceteris paribus) with respect to the used variables. As such estimates, they influence the further course of the exploration of the model space in a probabilistic manner. From the perspective of the evolutionary process, these estimates represent the contribution of the respective parameter to the overall fitness of the model. They also form a kind of long-term memory within the process, something like a probabilistic genome. The short-term memory in this evolutionary process is represented by the intensional profiles of the nodes in the SOM.

For the first initializing step, the evolutionary estimates can be estimated themselves by linear procedure like the PCA, or by non-parametric procedures (Kruskal-Wallis, Mann-Whitney, etc.), and are available after only a few explicitly calculated models (model here means „ceteris paribus selection of variables“).

These evolutionary weights reflect the changes of the predictive power of the model when adding or removing variables to the model. If the quality of the model improves, the evolutionary weight increases a bit, and vice versa. In other words, not the apriori parameters of the model are considered, but just the effect of the parameters. The procedure is an approximating repetition: fix the parameters of the model (method specific, sampling, variables), calculate the model, record the change of the predictive power as compared to the previous model.

Upon the probabilistic genome of evolutionary weights there are many different ways one could take to implement the “evo-devo” mechanisms, let it be the issue of how to handle the population (e.g. mixing genomes, aspects of virtual ecology, etc.), or the translational mechanisms, so to speak the “physiologies” that are used to proceed from the genome to an actual phenotype.

Since many different combinations are being calculated, the evolutionary weight represents the expectable contribution of a variable to the predictive power of the model, under whatsoever selection of variables that represents a model. Usually, a variable will not improve the quality of the model irrespective to the context. Yet, if a variable indeed would do so, we not only would say that its evolutionary weight equals 1, we also may conclude that this variable is a so-called confounder. Including a confounder into a model means that we use information about the target, which will not be available when applying the model for classification of new data; hence the model will fail disastrously. Usually, and that’s just a further benefit of dropping the independence-universalism assumption, it is not possible for a procedure to identify confounders by itself. It is also clear that the capability to do so is one of the cornerstones of autonomous learning, which includes the capability to set up the learning task.

Noise, and Noise

Optimization raises its own follow-up problems, of course. The most salient of these is so-called overfitting. This means that the model gets suitably fitted to the available observations by including a large number of parameters and variables, but it will return wrong predictions if it is going to be used on data that are even only slightly different from the observations used for learning and estimating the parameters of the model. The model represents noise, random variations without predictive value.

As we have been describing above, Poggio believes that his criterion of stability overcomes the defects with regard to the model as a generalization from observations. Poggio might be too optimistic, though, since his method still remains to be confined to the available observations.

In this situation, we apply a methodological trick. The trick consists in turning the problem into a target of investigation, which ultimately translates the problem into an appropriate rule. In this sense, we consider noise not as a problem, but as a tool.

Technically, we destroy the relevance of the differences between the observations by adding noise of a particular characteristic. If we add a small amount of normally distributed noise, nothing will probably change, but if we add a lot of noise, perhaps even of secondarily changing distribution, this will result in the mere impossibility to create a stable model at all. The scientific approach is to describe the dependency between those two unknowns, so to say, to set up a differential between noise (model for the unknown) and the model (of the unknown). The rest is straightforward: creating various data sets that have been changed by imposing different amounts of noise of a known structure, and plotting the predictive power against the the amount of noise. This technique can be combined by surrogating the actual observations via a Cholesky decomposition.

From all available models then those are preferred that combine a suitable predictive power with suitable degree of stability against noise.

Résumé

In this section we have dealt with the problematics of selecting a suitable subset from all available observables (neglecting for the time being that model selection involves the method’s parameters, too). Since mostly we have more observables at our disposal than we actually presume to need, the task could be simply described as simplification, aka Occam’s Razor. Yet, it would be terribly naive to first assume linearity and then selecting the “most parsimonious” model. It is even cruel to state [9, p.1]:

It is said that Einstein once said

Make things as simple as possible, but not simpler.

I hope that I succeeded in providing some valuable hints for accomplishing that task, which above all is not a quite simple one. (etc.etc. :)

Describing Classifiers

The gold standard for describing classifiers is believed to be the Receiver-Operator-Characteristic, or short, ROC. Particularly, the area under the curve is compared across models (classifiers). The following Figure 1demonstrates the mechanics of the ROC plot.

Figure 1: Basic characteristics of the ROC curve (reproduced from Wikipedia)

Figure 2. Realistic ROC curves, though these are typical for approaches that are NOT based on sub-group structures or ensembles (for instance ANN or logistic regression). Note that models should not be selected on the basis of the Area-under-Curve. Instead the true positive rate (sensitivity) at a false positive rate FPR=0 should be used for that. As a further criterion that would indicate the stability of of the model one could use the slope of the curve at FPR=0.

Utilization of Information

There is still another harmful aspect of the universalistic stance in data analysis as compared to a pragmatic stance. This aspect considers the „reach“ of the models we are going to build.

Let us assume that we would accept a sensitivity of approx 80%, but we also expect a specificity of >99%. In other words, the cost for false positives (FP) are defined as very high, while the costs for false negatives (FN, not recognized preferred outcomes) are relatively low. The ratio of costs for error, or in short the error cost ratio err(FP)/err(FN) is high.

Table 1a: A Confusion matrix for a quite performant classifier.

Symbols: test=model; TP=true positives; FP=false positives; FN=false negatives; TN=true negatives; ppv=positive predictive value, npv=negative predictive value. FN is also called type-I-error (analogous to “rejecting the null hypothesis when it is true”), while FP is called type-II-error (analogous to “accepting the null hypothesis when it is false”), and FP/(TP+FP) is called type-II-error-rate, sometime labeled as β-error, where (1-β) is the called the “power” of the test or model. (download XLS example)

condition Pos

condition Neg

test Pos

100 (TP)

3 (FP)

0.971

ppv

test Neg

28 (FN)

1120 (TN)

0.976

npv

0.781

0.997

sensitivity

specificity

Let us further assume that there are observations of our preferred outcome that we can‘t distinguish well from other cases of the opposite outcome that we try to avoid. They are too similar, and due to that similarity they form a separate group in our self-organizing map. Let us assume that the specificity of these clusters is at 86% only and the sensitivity is at 94%.

Table 1b: Confusion matrix describing a sub-group formed inside the SOM, for instance as it could be derived from the extension of a “node”.

condition Pos

condition Neg

test Pos

0 (50)

0 (39)

0.0 (0.56)

ppv

test Neg

50 (0)

39 (0)

0.44 (1.0)

npv

0.0 (1.0)

1.0 (0.0)

sensitivity

specificity

Yet, this cluster would not satisfy our risk attitude. If we would use the SOM as a model for classification of new observations, and the new observation would fall into that group (by means of similarity considerations) the implied risk would violate our attitude. Hence, we have to exclude such clusters. In the ROC this cluster represents a value further to the right on the specificity (X-) axis.

Note that in the case of acceptance of the subgroup as a representative for a contributor of a positive prediction, the false negative is always 0 aposteriori, and in case of denial the true positives is always set to 0 (accordingly the figures for the condition negative).

There are now several important points to that, which are related to each other. Actually, we should be interested only in such sub-groups with specificity close to 1, such that our risk attitude is well served. [13] Likewise, we should not try to optimize the quality of the model across the whole range of the ROC, but only for the subgroups with acceptable error cost ratio. In other words, we use the available information in a very specific manner.

As a consequence, we have to set the ECR before calculating the model. Setting the ECR after the selection of a model results in a waste of information, time and money. For this reason it is strongly indicated to use methods that are based on building a representation by sub-groups. This again rules out statistical methods as they always take into account all available data. Zytkow calls such methods empirically empty [14].

The possibility to build models of a high specificity is a huge benefit of sub-group based methods like the SOM.16 To understand this better let us assume we have a SOM-based model with the following overall confusion matrix.

condition Pos

condition Neg

test Pos

78

1

0.9873

ppv

test Neg

145

498

0.7745

npv

0.350

0.998

sensitivity

specificity

That is, the model recognizes around 35% of all preferred outcomes. It does so on the basis of sub-groups that all satisfy the respective ECR criterion. Thus we know that the implied risk of any classification is very low too. In other words, such models recognize whether it is allowed to apply them. If we apply them and get a positive answer, we also know that it is justified to apply them. Once the model identifies a preferred outcome, it does so without risk. This lets us miss opportunities, but we won’t be trapped by false expectations. Such models we could call auto-consistent.

In a practical project that has been aiming at an improvement of the post-surgery risk classification of patients (n>12’000) in a hospital we have been able to demonstrate that the achievable validated rate of implied risk can be as low as <10e-4. [15] Such a low rate is not achievable by statistical methods, simply because there are far too few incidents of wrong classifications. The subjective cut-off points in logistic regression are not quite suitable for such tasks.

At the same time, and that’s probably even more important, we get a suitable segmentation of the observations. All observations that can be identified as positive do not suffer from any risk. Thus, we can investigate the structure of the data for these observations, e.g. as particular relationships between variables, such as correlations etc. But, hey, that job is already done by the selection of the appropriate set of variables! In other words, we not only have a good model, we also have found the best possibility for a multi-variate reduction of noise, with a full consideration of the dependencies between variables. Such models can be conceived as reversed factorial experimental design.

The property of auto-consistency offers a further benefit as it is scalable, that is, “auto-consistent” is not a categorical, or symbolic, assignment. It can be easily measured as sensitivity under the condition of specificity > 1-ε, ε→0. Thus, we may use it as a random measure (it can be described by its density) or as a scale of reference in case of any selection task among sub-populations of models. Additionally, if the exploration of the model space does not succeed in finding a model of a suitable degree of auto-consistency, we may conclude that the quality of the data is not sufficient. Data quality is a function of properly selected variables (predictors) and reproducible measurement. We know of no other approach that would be able to inform about the quality of the data without referring to extensive contextual “knowledge”. Needless to say that such knowledge is never available and encodable.

There are only weak conditions that need to be satisfied. For instance, the same selection of variables need to be used within a single model for all similarity considerations. This rules out all ensemble methods, as far as different selections of variables are used for each item in the ensemble; for instance decision tree methods (a SOM with its sub-groups is already “ensemble-like”, yet, all sub-groups are affected by the same selection of variables). It is further required to use a method that performs the transition from extensions to intensions on a sub-group level,which rules out analytic methods, and even Artificial Neural Networks (ANN). The way to establish auto-consistent models is not possible for ANN. Else, the error-cost ratio must be set before calculating the model, and the models have to be calculated explicitly, which removes linear methods from the list, such as Support Vector Machines with linear kernels (regression, ANN, Bayes). If we want to access the rich harvest of auto-consistent models we have to drop the independence hypothesis and we have to refute any kind of universalism. But these costs are rather low, indeed.

Observations and Probabilities

Here we developed a particular perspective onto the transition from observations to intensional representations. There are of course some interesting relationships of our point of view to the various possibilities of “interpreting” probability (see [16] for a comprehensive list of “interpretations” and interesting references). We also provide a new answer to Hume’s problem of induction.

Hume posed the question, how often should we observe a fact until we could consider it as lawful? This question, called the “problem of induction” points to the wrong direction and will trigger only irrelevant answer. Hume, living still in times of absolute monarchism, in a society deeply structured by religious beliefs, established a short-cut between the frequency of an observation and its propositional representation. The actual question, however, is how to achieve what we call an “observation”.

In very simple, almost artificial cases like the die there is nothing to interpret. The die and its values are already symbols. It is in some way inadequate to conceive of a die or of dicing as an empirical issue. In fact, we know before what could happen. The universe of the die consists of precisely 6 singular points.

Another extreme are so-called single-case observations of structurally rich events, or processes. An event, or a setting should be called structurally rich, if there are (1) many different outcomes, and (2) many possible assignates to describe the event or the process. Such events or processes will not produce any outcome that is could be expected by symbolic or formal considerations. Obviously, it is not possible to assign a relative frequency to a unique, a singular, or a non-repeatable event. Unfortunately, however, as Hájek points out [17], any actual sequence can be conceived of as a singular event.

The important point now is that single-case observations are also not sufficiently describable as an empirical issue. Ascribing propensities to objects-in-the-world demands for a wealth of modeling activities and classifications, which have to be completed apriori to the observation under scrutiny. So-called single-case propensities are not a problem of probabilistic theory, but one of the application of intensional classes and their usage as means for organizing one’s own expectations. As we said earlier, probability as it is used in probability theory is not a concept that could be applied meaningful to observations, where observations are conceived of as primitive “givens”. Probabilities are meaningful only in the closed world of available subjectively held concepts.

We thus have to distinguish between two areas of application for the concept of probability: the observational part, where we build up classes, and the anticipatory part, where we are interested in a match of expectations and actual outcomes. The problem obviously arises by mixing them through the notion of causality.17 Yet, there is absolutely no necessity between the two areas. The concept of risk probably allows for a resolution of the problems, since risk always implies a preceding choice of a cost function, which necessarily is subjective. Yet, the cost function and the risk implied by a classification model is also the angle point for any kind of negotiation, whether this takes place on an material, hence evolutionary scale, or within a societal context.

The interesting, if not salient point is that the subjectively available intensional descriptions and classes are dependent on ones risk attitude. We may observe the same thing only  if we have acquired the same system of related classes and the same habits of using them. Only if we apply extreme risk aversion we will achieve a common understanding about facts (in the Wittgensteinian sense, see above). This then is called science, for instance. Yet, it still remains a misunderstanding to equate this common understanding with objects as objects-out-there.

The problem of induction thus must be considered as a seriously  ill-posed problem. It is a problem only for idealists (who then solve it in a weird way), or realists that are naive against the epistemological conditions of acting in the world. Our proposal for the transition from observations to descriptions is based on probabilism on both sides, yet, on either side there is a distinct flavor of probabilism.

Finally, a methodological remark shall be allowed, closely related to what we already described in the section about “noise” above. The perspective onto “making experience” that we have been proposing here demonstrates a significant twist.

Above we already mentioned Alan Hájek’s diagnosis that the frequentist and the Bayesian interpretation of probabilities suffer from the reference class problem. In this section we extended Hájek’s concerns to the concept of propensity. Yet, if the problem shows a high prevalence we should not conceive it as a hurdle but should try to treat it dynamically as a rule.The reference class is only a problem as long as (1) either the actual class is required as an external constant, or (2) the abstract concept of the class is treated as a fixed point. According to the rule of Lagrange-Deleuze, any constant can be rewritten into a procedure (read: rules) and less problematic constants. Constants, or fixed points on a higher abstract level are less problematic, because the empirically grounded semantics vanishes.

Indeed, the problem of the reference class simply disappears if we put the concept of the class, together with all the related issues of modeling, as the embedding frame, the condition under which any notion of probability only can make sense at all. The classes itself are results of “rule-following”, which  admittedly is blind, but whose parameters are also transparently accessible. In this way, probabilistic interpretation is always performed in a universe, that is closed and in principle fully mapped. We need the probabilistic methods just because that universe is of a huge size. In other words, the space of models is a Laplacean Universe.

Since statistical methods and similar interpretations of probability are analytical techniques, our proposal for a re-positioning of statistics into such a Laplacean Universe is also well aligned with the general habit of Wittgenstein’s philosophy, which puts practiced logic (quasi-logic) second to performance.

The disappearance of the reference class problem should be expected if our relations to the world are always mediated through the activity with abstract, epistemic modeling. The usage of probability theory as a “conceptual game” aiming for sharing diverging attitudes towards risks appears as nothing else than just a particular style of modeling, though admittedly one that offers a reasonable rate of success.

The Result of Modeling

It should be clear by now, that the result of modeling is much more than just a single predictive model. Regardless whether we take the scientific perspective or a philosophical vantage point, we need to include operationalizations of the conditions of the model, that reach beyond the standard empirical risk expressed as “false classification”. Appropriate modeling provides not only a set of models with well-estimated stability and of different structures; a further goal is to establish models that are auto-consistent.

If the modeling employs a method that exposes its parameters, we even can avoid the „method hell“, that is, the results are not only reliable, they are also valid.

It is clear that only auto-consistent models are useful for drawing conclusions and in building up experience. If variables are just weighted without actually being removed, as for instance in approaches like the Support Vector Machines, the resulting methods are not auto-consistent. Hence, there is no way towards a propositional description of the observed process.

Given the population of explicitly tested models it is also possible to describe the differential contribution of any variable to the predictive power of a model. The assumption of neutrality or symmetry of that contribution, as it is for instance applied in statistical learning, is a simplistic perspective onto the variables and the system represented by them.

Conclusion

In this essay we described some technical aspects of the capability to experience. These technical aspects link the possibility for experience to the primacy of interpretation that gets actualized as the techné of anticipatory, i.e. predictive or diagnostic modeling. This techné does not address the creation or derivation of a particular model by means of employing one or several methods. The process of building a model could be fully automated anyway. Quite differently, it focuses the parametrization, validation, evaluation and application of models, particularly with respect to the task of extract a rule from observational data. This extraction of rules must not be conceived as a “drawing of conclusions” guided by logic. It is a constructive activity.

The salient topics in this practice are the selection of models and the description of the classifiers. We emphasized that the goal of modeling should not be conceived as the task of finding a single best model.

Methods like the Self-organizing Map which are based on sub-group segmentation of the data can be used to create auto-consistent models, which represent also an optimally de-noised subset of the measured data. This data sample could be conceived as if it would have been found by a factorial experimental design. Thus, auto-consistent models also provide quite valuable hints for the setup of the Taguchi method of quality assurance, which could be seen as a precipitation of organizational experience.

In the context of exploratory investigation of observational data one first has to determine the suitable observables (variables, predictors) and, by means of the same model(s), the suitable segment of observations before drawing domain-specific conclusions. Such conclusions are often expressed as contrasts in location or variation. In the context of designed experiments as e.g. in pharmaceutical research one first has to check the quality of the data, then to de-noise the data by removing outliers by means of the same data segmentation technique, before again null hypotheses about expected contrasts could be tested.

As such, auto-consistent models provide a perfect basis for learning and for extending the “experience” of an epistemic individual. According to our proposals this experience does not suffer from the various problems of traditional Humean empirism (the induction problem), or contemporary (defective) theories of probabilism (mainly the problem of reference classes). Nevertheless, our approach remains fully empirico-epistemological.

Notes

1. As many other philosophers Lyotard emphasized the indisputability of an attention for the incidential, not as a perception-as, but as an aisthesis, as a forming impression. see: Dieter Mersch, ›Geschieht es?‹ Ereignisdenken bei Derrida und Lyotard. available online, last accessed May 1st, 2012. Another recent source arguing into the same direction is John McDowell’s “Mind and World” (1996).

2. The label “representationalism” has been used by Dreyfus in his critique of symbolic AI, the thesis of the “computational mind” and any similar approach that assumes (1) that the meaning of symbols is given by their reference to objects, and (2) that this meaning is independent of actual thoughts, see also [2].

3. It would be inadequate to represent such a two-fold “almost” dichotomy as a 2-axis coordinate system, even if such a representation would be a metaphorical one only; rather, it should be conceived as a tetraedic space, given by two vectors passing nearby without intersecting each other. Additionally, the structure of that space must not expected to be flat, it looks much more like an inhomogeneous hyperbolic space.

4. “Experiencibility” here not understood as an individual capability to witness or receptivity, but as the abstract possibility to experience.

5. In the same way we reject Husserl’s phenomenology. Phenomena, much like the objects of positivism or the thing-as-such of idealism, are not “out there”, they are result of our experiencibility. Of course, we do not deny that there is a materiality that is independent from our epistemic acts, but that does not explain or describe anything. In other words we propose go subjective (see also [3]).

6. Again, mechanism here should not be misunderstood as a single deterministic process as it could be represented by a (trivial) machine.

7. This question refers to the famous passage in the Tractatus, that “The world is everything that is the case.“ Cases, in the terminology of the Tractatus, are facts as the existence of states of affairs. We may say, there are certain relations. In the Tractatus, Wittgenstein excluded relations that could not be explicated by the use of symbols., expressed by the 7th proposition: „Whereof one cannot speak, thereof one must be silent.“

8. We must step outside of language in order to see the working of language.

9. We just have to repeat it again, since many people develop misunderstandings here. We do not deny the material aspects of the world.

10. “individual” is quite misleading here, since our brain and even our mind is not in-divisable in the atomistic sense.

11. thus, it is also not reasonable to claim the existence of a somehow dualistic language, one part being without ambiguities and vagueness, the other one establishing ambiguity deliberately by means of metaphors. Lakoff & Johnson started from a similar idea, yet they developed it into a direction that is fundamentally incompatible with our views in many ways.

12. Of course, the borders are not well defined here.

13. “predictive power” could be operationalized in quite different ways, of course….

14. Correlational analysis is not a candidate to resolve this problem, since it can’t be used to segment the data or to identify groups in the data. Correlational analysis should be performed only subsequent to a segmentation of the data.

15. The so-called genetic algorithms are not algorithms in the narrow sense, since there is no well-defined stopping rule.

16. It is important to recognize that Artificial Neural Networks are NOT belonging to the family of sub-group based methods.

17. Here another circle closes: the concept of causality can’t be used in a meaningful way without considering its close amalgamation with the concept of information, as we argued here. For this reason, Judea Pearl’s approach towards causality [16] is seriously defective, because he completely neglects the epistemic issue of information.

References
  • [1] Geoffrey C. Bowker, Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. MIT Press, Boston 1999.
  • [2] Willian Croft, Esther J. Wood, Construal operations in linguistics and artificial intelligence. in: Liliana Albertazzi (ed.) , Meaning and Cognition. Benjamins Publ, Amsterdam 2000.
  • [3] Wilhelm Vossenkuhl. Solipsismus und Sprachkritik. Beiträge zu Wittgenstein. Parerga, Berlin 2009.
  • [4] Douglas Hofstadter, Fluid Concepts And Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought. Basic Books, New York 1996.
  • [5] Nicholas F. Gier, Wittgenstein and Deconstruction, Review of Contemporary Philosophy 6 (2007); first publ. in Nov 1989. Online available.
  • [6] Henk L. Mulder, B.F.B. van de Velde-Schlick (eds.), Moritz Schlick, Philosophical Papers, Volume II: (1925-1936), Series: Vienna Circle Collection, Vol. 11b, Springer, Berlin New York 1979. with Google Books
  • [7] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee & Partha Niyogi (2004). General conditions for predictivity in learning theory. Nature 428, 419-422.
  • [8]  Vladimir Vapnik, The Nature of Statistical Learning Theory (Information Science and Statistics). Springer 2000.
  • [9] Herman J. Bierens (2006). Information Criteria and Model Selection. Lecture notes, mimeo, Pennsylvania State University. available online.
  • [10 ]Brian Weatherson (2007). The Bayesian and the Dogmatist. Aristotelian Society Vol.107, Issue 1pt2, 169–185. draft available online
  • [11] Edward I. George (2000). The Variable Selection Problem. J Am Stat Assoc, Vol. 95 (452), pp. 1304-1308. available online, as research paper.
  • [12] Alan Hájek (2007). The Reference Class Problem is Your Problem Too. Synthese 156(3): 563-585. draft available online.
  • [13] Lori E. Dodd, Margaret S. Pepe (2003). Partial AUC Estimation and Regression. Biometrics 59( 3), 614–623.
  • [14] Zytkov J. (1997). Knowledge=concepts: a harmful equation. 3rd Conference on Knowledge Discovery in Databases, Proceedings of KDD-97, p.104-109.AAAI Press.
  • [15] Thomas Kaufmann, Klaus Wassermann, Guido Schüpfer (2007).  Beta error free risk identification based on SPELA, a neuro-evolution method. presented at ESA 2007.
  • [16] Alan Hájek, “Interpretations of Probability”, The Stanford Encyclopedia of Philosophy (Summer 2012 Edition), Edward N. Zalta (ed.), available online, or forthcoming.
  • [17] Judea Pearl, Causality – Models, Reasoning, and Inference. 2nd ed. Cambridge University Press, Cambridge  (Mass.) 2008 [2000].

۞

Waves, Words and Images

April 7, 2012 § 1 Comment

The big question of philosophy, and probably its sole question,

concerns the status of the human as a concept.1 Does language play a salient role in this concept, either as a major constituent, or as sort of a tool? Which other capabilities and which potential beyond language, if it is reasonable at all to take that perspective, could be regarded as similarly constitutive?

These questions may appear far off such topics like the technical challenges to program a population of self-organizing maps, the limits of Turing-machines, or the generalization of models and their conditions. Yet, in times where lots of people are summoning the so-called singularity, the question about the status of the human is definitely not exotic at all. Notably, “singularity” is often and largely defined as “overwhelming intelligence”, seemingly coming up inevitably due to ever increasing calculation power, and which we could not “understand” any more.  From an evolutionary perspective it makes pretty little sense to talk about singularities. Natural evolution, and cultural evolution alike, is full of singularities and void of singularities at the same time. The idea of “singularity” is not a fruitful way to approach the question of qualitative changes.

As you already may have read in another chapter, we prefer the concept of machine-based episteme as our ariadnic guide. In popular terms, machine-based episteme concerns the possibility for an actualization of a particular “machine” that would understand the conditions of its own when claiming “I know.” (Such an entity could not be regarded as a machine anymore, I guess.) Of course, in following this thread we meet a lot of already much-debated issues. Yet, moving the question about the episteme into the sphere of the machinic provides particular perspectives onto these issues.

In earlier times it has been tried, and some people still are trying today, to determine that status of the “human” as sort of a recipe. Do this and do that, but not that and this, then a particular quality will be established in your body, as your person, visible for others as virtues, labeled and conceived henceforth as “quality of being human”. Accordingly, natural language with all its ambiguities need not be regarded as an essential pillar. Quite to the opposite, if the “human” could be defined as a recipe, then our everyday language has to be cleaned up, made more close to crisp logic in order to avoid misunderstandings as far as possible; you may recognize this as the program of contemporary analytical philosophy. In methodological terms it was thought that it would be possible to determine the status of the human in positively given terms, or short, in a positive definite manner.

Such positions are, quite fortunately so, now recognized more and more as highly problematic. The main reason is that it is not possible to justify any kind of determination in an absolute manner. Any justification requires assumptions, while unjustified assumptions are counter-pragmatic to the intended justification. The problematics of knowledge is linked in here, as it could not be regarded as “justified, true belief” any more2. It was first Charles S. Peirce who concluded that the application of logic (as the grammar of reason) and ethics (as the theory of morality) are not independent from each other. In political terms, any positive definite determination that would be imposed to communities of other people must be regarded as an instance of violence. Hence, philosophy is not any more concerned about the status of the human as a fact, but, quite differently, the central question is how to speak about the status of the human, thereby not neglecting that speaking, using language is not a private affair. This looking for the “how” has to obey, of course, itself to the rule not to determine rules in a positive definite manner. As a consequence, the only philosophical work we can do is exploring the conditions, where the concept of “condition” refers to an open, though not recursive, chain. Actually, already Aristotle dubbed this as “metaphysics” and as the core interest of philosophy. This “metaphysics” can’t be overtaken by any “natural” discipline, whether it is a kind of science or engineering. There is a clear downstream relation: science as well as engineering should be affected by it in emphasizing the conditions for their work more intensely.

Practicing, turning the conditions and conditionability into facts and constraints is the job of design, let it manifest this design as “design,” as architecture, as machine-creating technology, as politician, as education, as writer and artist, etc.etc.  Philosophy can not only never explain, as Wittgenstein mentioned, it also can’t describe things “as such”. Descriptions and explanations are only possible within a socially negotiated system of normative choices. This holds true even for natural sciences. As a consequence, we should start with philosophical questions even in the natural sciences, and definitely always in engineering. And engaging in fields like machine learning, so-called artificial intelligence or robotics without constantly referring to philosophy will almost inevitably result in nonsense. The history of these fields a full of examples for that, just remember the infamous “General Problem Solver” of Simon and Newell.

Yet, the issue is not only one of ethics, morality and politics. It has been Foucault as the first one, in sort of a follow-up to Merleau-Ponty, who claimed a third region between the empiricism of affections and the tradition of reflecting on pure reason or consciousness.3 This third region, or even dimension (we would say “aspection”), being based on the compound consisting from perception and the body, comprises the historical evolution of systems of thinking. Foucault, together with Deleuze, once opened the possibility for a transcendental empiricism, the former mostly with regard to historical and structural issues of political power, the latter mostly with regard to the micronics of individual thought, where the “individual” is not bound to a single human person, of course. In our project as represented by this collection of essays we are following a similar path, starting with the transition from the material to the immaterial by means of association, and then investigating the dynamics of thinking in the aspectional space of transcendental conditions (forthcoming chapter), which build an abstract bridge between Deleuze and Foucault as it covers both the individual and the societal aspects of thinking.

This Essay

This essay deals with the relation of words and a rather important aspect in thinking, representation. We will address some aspects of its problematics, before we approach the role of words in language. Since the representation is something symbolic in the widest sense and that representation has to be achieved autonomously by a mainly material arrangement, e.g. called “the machine”4, we also will deal (again) with the conditions for the transformation of (mainly) physical matter into (mainly) symbolic matter. Particularly, however, we will explore the role of words in language. The outline comprises the following sections:

From Matter to Mind

Given the conditioning mentioned above, the anthropological history of the genus of homo5 poses a puzzle. Our anatomical foundations6 have been stable since at least 60’000 years, but contemporary human beings at the age of, let me say, 20 or 30 years are surely much more “intelligent”7. Given the measurement scale established as I.Q. in the beginning of the 20th century, a significant increase can be observed for the supervised populations even throughout the last 60 years.

So, what makes the difference then, between the earliest ancient cultures and the contemporary ones? This question is highly relevant for our considerations here that focus on the possibility of a machine-based episteme, or in more standard, yet seriously misplaced terms, machine learning, machine intelligence or even artificial intelligence. In any of those fields, one could argue, researchers and engineers somehow start with mere matter, then imprinting some rules and symbols to that matter, only to expect then the matter becoming “intelligent” in the end. The structure of the problematics remains the same, whether we take the transition that started from paleo-cultures or that rooted in the field of advanced computer science. Both instances concern the role of culture in the transformation of physical matter into symbolic matter.

While philosophy has tackled that issue for at least two and a half millennia, resulting in a rich landscape of arguments, including the reflection of the many styles of developing those arguments, computer science is still almost completely blind against the whole topic. Since computer scientists and computer engineers inevitably get into contact with the realm of the symbolic, they usually and naively repeat past positions, committing naïve, i.e. non-reflective idealism or materialism that is not even on a pre-socratic level. David Blair [6] correctly identifies the picture of language on which contemporary information retrieval systems are based on as that of Augustine: He believed that every word has a meaning. Notably, Augustine lived in the late 4th till early 5th century A.C. This story simply demonstrates that in order to understand the work of a field one also has, as always, to understand its history. In case of computer sciences it is the history of reflective thought itself.

Precisely this is also the reason for the fact that philosophy is much more than just a possibly interesting source for computer scientists. More directly expressed, it is probably one of the major structural faults of computer science that it is regarded as just a kind of engineering. Countless projects and pieces of software failed for the reason of such applied methodological reductionism. Everything that gets into contact with computers developed from within such an attitude then also becomes infected by the limited perspective of engineering.

One of the missing aspects is the philosophy of techno-science, which not just by chance seriously started with Heidegger8 as its first major proponent. Merleau-Ponty, inspired by Heidegger, then emphasized that everything concerning the human is artificial and natural at the same time. It does not make sense to set up that distinction for humans or man-made artifacts as well, as if such a difference would itself be “natural”. Any such distinction refers more directly than not to Descartes as well as to Hegel, that is, it follows either simplistic materialism or overdone idealism, so to speak idealism in its machinic, Cartesian form. Indeed, many misunderstandings about the role of computers in contemporary science and engineering, but also in the philosophy of science and the philosophy of information can be deciphered as a massive Cartesio-Hegelian heir, with all its drawbacks. And there are many.

The most salient perhaps is the foundational element9 of Descartes’ as well as Hegel’s thoughts: independence. Of course, for both of them independence was a major incentive, goal and demand, for political reasons (absolutism in the European 17th century), but also for general reasons imposed by the level of techno-scientific insights, which remained quite low until the mid of the 20th century. People before the scientific age had been exposed to all sorts of threatening issues, concerning health, finances, religious or political freedom, collective or individual violence, all together often termed “fate”. Being independent meant a basic condition to live more or less safely at all, physically and/or  mentally. Yet, Descartes and Hegel definitely exaggerated it.

Yet, the element of independence made its way into the cores of the scientific method itself. Here it blossomed as reductionism, positivism and physicalism, all of which can be subsumed under the label of naive realism. It took decades until people developed some confidence not to prejudge complexity as esotericism.

With regard to computer science there is an important consequence. We first and safely can drop the label of  “artificial intelligence” or “machine learning” just along with the respective narrow and limited concepts. Concerning machine learning we can state that only very few of the approaches to machine learning that exist so far is at most a rudimentary learning in the sense of structural self-transformation. The vast majority of approaches that are dubbed as “machine learning” represent just some sort of advanced parameter estimation, where the parameters to be estimated are all defined (i) apriori, and (ii) by the programmer(s). And regarding intelligence we can recognize that we never can assign concepts like artificial or natural to it, since there is always a strong dependence on culture in it. Michel Serres once called written language the first artificial intelligence, pointing to the central issue of any technology: externalization of symbol-based systems of references.

This brings us back to our core issue here, the conditions for the transformation of (mainly) physical matter into (mainly) symbolic matter. In some important way we even can state that there is no matter without symbolic aspects. Two pieces of matter can interact only if they are not completely transparent to each other. If there is an effective transfer of energy between those, then the form of the energy becomes important, think of it for instance as wave length of some electromagnetic radiation, or the rhythmicity of it, which becomes distinctive in the case of a LASER [9,10]. Sure, in a LASER there are no symbols to be found; yet, the system as a whole establishes a well-defined and self-focusing classification, i.e. it performs the transition from a white-noised, real-valued randomness to a discrete intensional dynamics. The LASER has thus to be regarded as a particular kind of associative system, which is able to produce proto-symbols.

Of course, we may not restrict our considerations to such basic instances of pan-semiotics. When talking about machine-based episteme we talk about the ability of an entity to think about the conditions for its own informational dynamics (avoiding the term knowledge here…). Obviously, this requires some kind of language. The question for any attempt to make machines “intelligent” thus concerns in turn the question about how to think about the individual acquisition of language, and, of course, with regard to our interests here how to implement the conditions for it. Note that homo erectus who lived 1 million years ago must have had a clear picture not only about causality, and not only individually, but they also must have had the ability to talk about that, since they have been able to keep fire burning and to utilize it for cooking meal and bones. Logic has not been invented as a field at these times, but it seems absolutely mandatory that they have been using a language.10 Even animals like cats, pigs or parrots are able to develop and to perform plans, i.e. to handle causality, albeit probably not in a conscious manner. Yet, neither wild pigs nor cats are able for symbol based culture, that is a culture, which spreads on the basis of symbols that are independent from a particular body or biological individual. The research programs of machine learning, robotics or artificial intelligence thus appears utterly naive, since they all neglect the cultural dimension.

The central set of questions thus considers the conditions that must be met in order to become able to deal with language, to learn it and to practice it.

These conditions are not only “private”, that is, they can’t be reduced to individual brains, or a machines, that would “process” information. Leaving the simplistic perspective onto information as it is usually practiced in computer sciences aside for the moment, we have to accept that learning language is a deeply social activity, even if the label of the material description of the entity is “computer”. We also have to think about the mediality of symbolic matter, the transition from nature to culture, that is from contexts of low symbolic intensity to those of high symbolic intensity. Handling language is not an affair that could be thought to be performed privately, there is no such thing as a “private language”. Of course, we have brains, for which the matter could still be regarded as dominant, and the processes running there are running only there11.

Note that implementing the handling of words as apriori existing symbols is not what we are talking about here. As Hofstadter pointed out [12], calling the computing processes on apriori defined strings “language understanding” is nothing but silly. We are not allowed to call the shuffling of predefined encoded symbols forth and back “understanding”. But what could we call “understanding” then? Again, we have to postpone this question for the time being. Meanwhile we may reshape the question about learning language a bit:

How do we come to be able to assign names to things, classes, types, species, animals and other humans? What is role of such naming, and what is the role of words?

The Unresolved Challenge

The big danger when addressing these issues is to start too late, provoked by an ontological stance that is applied to language. The most famous example probably being provided by Heidegger and his attempt of “fundamental ontology”, which failed glamorously. It is all too easy to get bewitched by language itself and to regard it as something natural, as something like stones: well-defined, stable, and potentially serving as a tool. Language itself makes us believe that words exist as such, independent from us.

Yet, language is a practice, as Wittgenstein said, and this practice is neither a single homogenous one nor does it remain constant throughout life, nor are the instances identical and exchangeable. The practice of language develops, unfolds, gains quasi-materiality, turns from an end to a means and back. Indeed, language may be characterized just by the capability to provide that variability in the domain of the symbolic. Take as a contrast for instance the symbolon, or take the use of signs in animals, in both cases there is exactly one single “game” you can play. Only in such trivial cases the meaning of a name could be said to be close to its referent. Yet, language games are not trivial.

I already mentioned the implicit popularity of Augustine among computer scientists and information systems engineers. Let me cite the passage that Wittgenstein chose in his opening remarks to the famous Philosophical Investigations (PI)12. Augustine writes:

When they (my elders) named some object, and accordingly moved towards something, I saw this and I grasped that the thing was called by the sound they uttered when they meant to point it out. Their intention was shewn by their bodily movements, as it were the natural language of all peoples: the expression of the face, the play of the eyes, the movement of other parts of the body, and the tone of voice which expresses our state of mind in seeking, having, rejecting, or avoiding something. Thus, as I heard words repeatedly used in their proper places in various sentences, I gradually learnt to understand what objects they signified; and after I had trained my mouth to form these signs, I used them to express my own desires.

Wittgenstein gave two replies, one directly in the PI, the other one in the collection entitled “Philosophical Grammar” (PG).

These words, it seems to me, give us a particular picture of the essence of human language. It is this: the individual words in language name objects—sentences are combinations of such names.—In this picture of language we find the roots of the following idea: Every word has a meaning. This meaning is correlated with the word. It is the object for which the word stands.

Augustine does not speak of there being any difference between kinds of word. If you describe the learning of language in this way you are, I believe, thinking primarily of nouns like “table,” “chair,” “bread,” and of people’s names, and only secondarily of the names of certain actions and properties; and of the remaining kind of words as something that will take care of itself. (PI §1)

And in the Philosophical Grammar:

When Augustine talks about the learning of language he talks about how we attach names to things or understand the names of things. Naming here appears as the foundation, the be all and end all of language. (PG 56)

Before we will take the step to drop and to drown the ontological stance once and for all we would like to provide two things. First, we will briefly cite a summarizing table from Blair [1]13. Blair’s book is indeed a quite nice work about the peculiarities of language as far as it concerns “information retrieval” and how Wittgenstein’s philosophy could be helpful in resolving the misunderstandings. Second, we will (also very briefly) make our perspective to names and naming explicit.

David Blair dedicates quite some efforts to render the issue of indeterminacy of language as clear as possible. In alignment to Wittgenstein he emphasizes that indeterminacy in language is not the result of sloppy or irrational usage. Language is neither a medium of logics nor a something like a projection screen of logics. There are good arguments, represented by the works of Ludwig Wittgenstein, late Hilary Putnam and Robert Brandom, to believe that language is not an inferior way to express a logical predicate (see the previous chapter about language). Language can’t be “cleared” or being made less ambiguous, its vagueness is a constitutive necessity for its use and utility in social intercourse. Many people in linguistics (e.g. Rooij [13]) and large parts of cognitive sciences (e.g. Alvin Goldman [14]14), but also philosophers like Saul Kripke [16] or Scott Soames [17] take the opposite position.

Of course, in some contexts it is reasonable to try to limit the vagueness of natural language, e.g. in law and contracts. Yet, it is also clear that positivism in jurisdiction is a rather bad thing, especially if it shows up as a pair with idealism.

Blair then contrasts two areas in so-called “information retrieval”15, distinguished by the type of data that is addressed: structured data that could be arranged in tables on the one hand, Blair calls it determinate data, and such “data” that can’t be structured apriori, like language. We already met this fundamental difference in other chapters (about analogies, language). The result of his investigation he summarized in the following table. It is more than obvious that the characteristics of the two fields are drastically different, which equally obvious has to be reflected in the methods going to be applied. For instance, the infamous n-gram method is definitely a no-go.

For the same reasons, semantic disambiguation is not possible by a set of rules that could be applied by an individual, whether this individual is a human or a machine. Quite likely it is even completely devoid of sense to try to remove ambiguity from language. One of the reasons is given by the fact that concepts are transcendental entities. We will return to the issue of “ambiguity” later.

In the quote from the PG shown above Wittgenstein rejects Augustine’s perspective that naming is central to language. Nevertheless, there is a renewed discussion in philosophy about names and so-called “natural kind terms”, brought up by Kripke’s “Naming and Necessity” [16]. Recently, Scott Soames explicitly referred to Kripke’s. Yet, as so many others, Soames commits the drastic mistake introduced along the line formed by Frege, Russell and Carnap in ascribing language the property of predicativity (cf. [18]  p.646).

These claims are developed within a broader theory which, details aside, identifies the meaning of a non-indexical sentence S with a proposition asserted by utterances of S in all normal contexts.

We won’t delve in any detail to the discussion of “proper names”16, because it is largely a misguided and unnecessary one. Let me just briefly mention three main (and popular) alternative approaches to address the meaning of names: the descriptivist theories, the referential theory originally arranged by John Stuart Mill, and the causal-historical theory. They are all not tenable because they implicitly violate the primacy of interpretation, though not in an obvious manner.

Why can’t we say that a name is a description? A description needs assignates17, or aspects, if you like, at least one scale. Assuming that there is the possibility for a description that is apriori justified and hence objective invokes divinity as a hidden parameter, or any other kind of Fregean hyper-idealism. Assignates are chosen according to and in dependence from the context. Of course, one could try to expel any variability of any expectable context, e.g. by literally programming society, or some kind of philosophical dictatorship. In any other case, descriptions are variant. The actual choice for any kind of description is the rather volatile result of negotiation processes in the embedding society. The rejection of names as description results from the contradictory pragmatic stances. First, names are taken as indivisible, atomic entities, but second descriptions are context-dependent subatomic properties, which by virtue of the implied pragmatics, corroborates the primary claim. Remember that the context-dependency results from the empirical underdetermination. In standard situations it is neither important that water consists as a compound of hydrogen and oxygen, nor is this what we want to say in everyday situations. We do not carry the full description of the named entity along into any instance of its use, despite there are some situations where we indeed are interested in the description, e.g. as a scientist, or as a supporter of  the “hydrogen economy”. The important point is that we never can determine the status of the name before we have interpreted the whole sentence, while we also can’t interpret the sentence without determining the status of the named entity. Both entities co-emerge. Hence we also can’t give an explicit rule for such a decision other than just using the name or uttering the sentence. Wittgenstein thus denies the view that assumes a meaning behind the words that is different from their usage.

The claim that the meaning of a proper name is its referent meets similar problems, because it just introduces the ontological stance through the backdoor. Identifying the meaning of a label with its referent implies that the meaning is taken as something objective, as something that is independent from context, and even beyond that, as something that could be packaged and transferred *as such*. In other words, it deliberately denies the primacy of interpretation. We need not say anything further, except perhaps that Kripke (and Soames as well, in taking it seriously) commits a third mistake in using “truth-values” as factual qualities.18 We may propose that the whole theory of proper names follows a pseudo-problem, induced by overgeneralized idealism or materialism.

Names, proper: Performing the turn completely

Yet, what would be an appropriate perspective to deal with the problem of names? What I would like to propose is a consequent application of the concept of “language game”. The “game” perspective could not only be applied to the complete stream of exchanged utterances, but also to the parts of the sentences, e.g. names and single words. As a result, new questions become visible. Wittgenstein himself did not explore this possibility (he took Augustine as a point of departure), and it could not be found in contemporary discourse either”19. As so often, philosophers influenced by positivism simply forget about the fact that they are speaking. Our proposal is markedly different from and also much more powerful than the causal-historical or the descriptivist approach, and also avoids the difficulties of Kripke’s externalist version.

After all, naming, to give a name and to use names, is a “language game”. Names are close to observable things, and as a matter of fact, observable things are also demonstrable. Using a name refers to the possibility of a speaker to provide a description to his partner in discourse such that this listener would be able to agree on the individuality of the referenced thing. The use of the name “water” for this particular liquid thing does not refer to an apriori fixed catalog of properties. Speaker and listener even need not agree on the identity of the set of properties ascribed to the referred physical thing. The chemist may always associate the physico-chemical properties of the molecule even when he reads about the submersed sailors in Shakespeare’s *tempest*, but nevertheless he easily could talk about that liquid matter with a 9 year old boy that does neither know about Shakespeare nor about the molecule.

It is thus neither possible nor is it reasonable to try to achieve a match regarding the properties, since a rich body of methods would be necessarily invoked to determine that set. Establishing the identity of representations of physical, external things, or even of the physical things themselves, inevitably invokes a normative act (which is rather incommensurable to the empiricists claims).

For instance, saying just “London”, out of the blue, it is not necessary that we envisage the same aspects of the grand urban area. Since cities are inevitably heterotopic entities (in the sense of Foucault [19, 20], acc. to David Graham Shane [21]), this agreement is actually impossible. Even for the undeniably more simple minded cartographers the same problem exists: “Where” is that London, in terms of spheric coordinates? Despite these unavoidable difficulties both the speaker and the listener easily agree on the individuality of the imaginary entity “London”. The name of “London” does not point to a physical thing but just to an imaginative pole. In contrast to concepts, however, names take a different grammatical role as they not only allow for a negotiation of rather primitive assignates in order to take action, they even demonstrate the possibility of such negotiation. The actual negotiations could be quite hard, though.

We conclude that we are not allowed to take any of the words as something that would “exist” as a, or like a physical “thing”. ­­­­Of course, we get used to certain words, the gain a quasi-materiality because a constancy appears that may be much stronger than the initial contingency. But this “getting used” is a different topic, it just refers how we speak about words. Naming remains a game, and as any other game this one also does not have an identifiable border.

Despite this manifold that is mediated through language, or as language, it is also clear that language remains rooted in activity or the possibility of it. I demonstrate the usage of a glass and accompany that by uttering “glass”. Of course, there is the Gavagai problematics20 as it has been devised by Quine [22]. Yet, this problematics is not a real problem, since we usually interact repeatedly. On the one hand this provides us the possibility to improve our capability to differentiate single concepts in a certain manner, but on the other hand the extended experience introduces a secondary indeterminacy.

In some way, all words are names. All words may be taken as indicators that there is the potential to say more about them, yet in a different, orthogonal story. This holds even for the abstract concepts denoted by the word “transcendental” or for verbs.

The usage of names, i.e. their application in the stream of sentences, gets more and more rich, but also more and more indeterminate. All languages developed some kind of grammar, which is a more or less strict body of rules about how to arrange words for certain language games. Yet, the grammar is not a necessity for language at all, it is just a tool to render language-based communication more easy, more fast and more precise. Beyond the grammars, it is the experience which enables us to use metaphors in a dedicated way. Yet, language is not a thing that sometimes contains metaphors and sometimes not. In a very basic sense all the language is metaphorical all the time.

So, we first conclude that there is nothing enigmatic in learning a language. Secondly, we can say that extending the “gameness” down to words provides the perspective of the mechanism, notably without reducing language to names or propositions.

Instead, we now can clearly see how these mechanisms mediate between the language game as a whole, the metaphorical characteristics of any language and simple rule-based mechanisms.

Representing Words

There is a drastic consequence of the completed gaming perspective. Words can’t be “represented” as symbols or as symbolic strings in the brain, and words can’t be appropriately represented as symbols in the computer either. Given any programming language, strings in a computer program are nothing else than particularly formatted series of values. Usually, this series is represented as an array of values, which is part of an object. In other words, the word is represented as a property of an object, where such objects are instances of their respective classes. Such, the representation of words in ANY computer program created so far for the purpose of handling texts, documents, or textual information in general is deeply inappropriate.

Instead, the representation of the word has to carry along its roots, its path of derivation, or in still other words, its traces of precipitation of the “showing”. This rooting includes, so we may say, a demonstrativum, an abstract image. This does not mean that we have to set up an object in the computer program that contains a string and an abstract image. This would be just the positivistic approach, leaving all problems untouched, the string and the image still being independent. the question of how to link them would be just delegated to the next analytic homunculus.

What we propose are non-representational abstract compounds that are irrevocably multi-modal since they are built from the assignates of  abstract “things” (Gegenstände). These compounds are nothing else than combined sets of assignates. The “things” represented in this way are actually always more or less “abstract”. Through the sets of assignates we actually may combine even things which appear incommensurable on the level of their wholeness, at least at first sight. An action is an action, not a word, and vice versa, an image is neither a word nor an action, isn’t it? Well, it depends; we already mentioned that we should not take words as ontological instances. Any of those entities can be described using the same formal structure, the probabilistic context that is further translated into a set of assignates. The probabilistic context creates a space of expressibility, where the incommensurability disappears, notably without reducing the comprised parts (image, text,…) to the slightest extent.

The situation reminds a bit synesthetic experiences. Yet, I would like to avoid calling it synesthetic, since synesthecism is experienced on a highly symbolic level. Like other phenomenological concepts, it also does not provide any hint about the underlying mechanisms. In contrast, we are talking about a much lower level of integration. Probably we could call this multi-modal compound a “syn-presentational” compound, or short, a “synpresentation”.21

Words, images and actions are represented together as a quite particular compound, which is an inextricable multi-modal compound. We also may say that these compounds are derived qualia. The exciting point is that the described way of probabilistic multi-modal representation obviates the need for explicit references and relations between words and images. These relations even would have to be defined apriori (strongly: before programming, weakly: before usage). In our approach, and quite to the contrast to the model of external control, relations and references *can be* subject to context-dependent alignments, either to the discourse, or the task (of preparing a deliverable from memory).

The demonstrativum may not only refer to an “image”. First note that the image does not exist outside of its interpretation. We need to refer to that interpretation, not to an index in a data base or a file system. Interpretation thus means that we apply a lot of various processing and extraction methods to it, each of them providing a few assignates. The image is dissolved into probabilistic contexts as we do it for words (footnote: we have described it elsewhere). The dissolving of an image is of course not the endpoint of a communicable interpretation, it is just the starting point. Yet, this does not matter, since the demonstrativum may also refer to any derived intension and even to any derived concept.22

The probabilistic multi-modal representation exhibits three highly interesting properties, concerning abstractness, relations and the issue of foundations. First, the  abstractness of represented items becomes scalable in an almost smooth manner. In our approach, “abstractness” is not a quality any more. Secondly, relations and references of both words and the “content” of images are transformed into their pre-specific versions. Both, relations and references need not be implemented apriori or observed as an apriori. Initially, they appear only as randolations23. Thirdly, some derived and already quite abstract entities on an intermediate level of “processing” are more basic than the so-called raw observations24.

Words, Classes, Models, Waves

It is somewhat tempting to arrange these four concepts to form a hierarchical series. Yet, things are not that simple. Actually, any of the concepts that appear more as a symbolistic entity also may re-turn into a quasi-materiality, into a wave-like phenomenon that itself serves as a basis for potential differences. This re-turn is a direct consequence of the inextricable mediality of the world, mediality understood here thus as a transcendental category. Needless to say that mediality is just another blind spot in contemporary computer sciences. Cybernetics as well as engineering straightaway exclude the possibility to recognize the mediatedness of worldly events.

In this section we will try to explicate the relations between the headlined concepts to some extent, at least as far as it concerns the mapping of those into an implementable system of (non-Turing) “computer programs”. The computational model that we presuppose here is the extended version of the 2-layered SOM, as we have it introduced previously.

Let us start with first things first. Given a physical signal, here in the literal sense, that is as a potentially perceivable difference in a stream of energy, we find embodied modeling, and nothing else. The embodiment of the initial modeling is actualized in sensory organs, or more generally, in any instance that is able to discretize the waves and differences at least “a bit more”. In more technical terms, the process of discretization is a process that increases the signal-noise ratio. In biological systems we often find a frequency encoding of the intensity of a difference. Though the embodiment of that modeling is indeed a filtering and encoding, hence already some kind of a modeling representation, it is not a modeling in the more narrow sense. It points out of the individual entity into the phylogenesis, the historical contingency of the production of that very individual entity. We also can’t say that the initial embodied processing by the sensory organs is a kind of encoding. There is no code consisting of well-identified symbols at the proximate end of the sensory cell. It is still a rather probabilistic affair.

This basic encoding is not yet symbolic, albeit we also can’t call it a wave any more. In biological entities this slightly discretized wave then is subject of an intense modeling sensu strictu. The processing of the signals is performed by associative mechanisms that are arranged in cascades. This “cascading” is highly interesting and probably one of the major mandatory ingredients that are neglected by computer science so far. The reason is quite clear: it is not an analytic process, hence it is excluded from computer science almost by definition.

Throughout that cascade signals turn more and more into information as an interpreted difference. It is clear that there is not a single or identifiable point in this cascade to which one could assign the turn from “data” to “information”. The process of interpretation is, quite in contrast to idealistic pictures of the process of thinking, not a single step. The discretized waves that flow into the processing cascade are subject to many instances and very different kinds of modeling, throughout of which discrete pieces get separated and related to other pieces. The processing cascade thus is repeating a modular principle consisting from association and distribution.

This level we still could not label as “thinking”, albeit it is clearly some kind of a mental process. Yet, we could still regard it as something “mechanical”, even as we also find already class-like representations, intensions and proto-concepts. Thinking in its meaningful dimension, however, appears only through assigning sharable symbols. Thinking of something implicitly means that one could tell about the respective thoughts. It does not matter much whether these symbols are shared between different regions in the brain or between different bodily entities does not matter much. Hence, thinking and mental processes need to be clearly distinguished. Yet, assigning symbols, that is assigning a word, a specific sound first, and later, as a further step of externalization, a specific grapheme that reflects the specific sound, which in turn represents an abstract symbol, this process of assigning symbols is only possible through cultural means. Cats may recognize situations very well and react accordingly, they may even have a feeling that they have encountered that situation before, but cats can’t share they symbols, they can’t communicate the relational structure of a situation. Yet, cats and dogs already may take part in “behavior games”, and such games clearly has been found in baboons by Fernando Colmenares [24]. Colmenares adopted the concept of “games” precisely because the co-occurrence of obvious rules, high variability, and predictive values of actions and reactions of the individual animals. Such games unfold synchronic as well as diachronic, and across dynamically changing assignment of social roles. All of this is accompanied by specific sounds. Other instances of language-like externalization of symbols can presumably be found in grey parrots [25], green vervet monkey [26], bonobos, dolphins and Orcas.

But still… in animals those already rather specific symbols are not externalized by imprinting them into matter different from their own bodies. One of the most desirable capabilities for our endeavor here about machine-based episteme thus consists in just that externalization processes embedded in social contexts.

Now the important thing to understand is that this whole process from waves to words is not simply a one-way track. First, words do not exist as such, they just appear as discrete entities through usage. It is the usage of X that introduces irreversibility. In other words, the discreteness of words is a quality that is completely on the aposteriori side of thinking. Before their actual usage, their arrangement into sentences words “are” nothing else than probabilistic relations. It needs a purpose, a target oriented selection (call it “goal-directed modeling”) to let them appear as crisp entities.

The second issue is that a sentence is an empirical phenomenon, remarkably even to the authoring brain itself. The sentence needs interpretation, because it is never ever fully determinate. Interpretation, however, of such indeterminate instances like sentences renders the apparent crisp phenomenon of words back into waves. A further effect of interpretation of sentences as series of symbols is the construction of a virtual network. Texts, and in a very similar way, pieces of music, should not be conceived as series, as computer linguistics is treating them. Much more appropriately texts are conceived as networks, that even may exert there own (again virtual) associative power, which to some extent is independent from the hosting interpreter, as I have argued here [28].

Role of Words

All these characteristics of words, their purely aposteriori crispness, their indeterminacy as sub-sentential indicators of randolational networks, their quality as signs by which they only point to other signs, but never to “objects”, their double quality as constituent and result of the “naming game”, all these “properties” make it actually appear as highly unlikely and questionable whether language is about references at all. Additionally, we know that the concept of “direct” access to the mind or the brain is simply absurd. Everything we know about the world as individuals is due to modeling and interpretation. That of course concerns also the interpretation of cultural artifacts or culturally enabled externalization of symbols, for instance into the graphemes that we use to represent words.

It is of utmost importance to understand that the written or drawn grapheme is not the “word” itself. The concept of a “word-as-such” is highly inappropriate, if not bare nonsense.

So, if words, sentences and language at large are not about “direct” referencing of (quasi-) material objects, how then should we conceive of the process we call “language game”, or “naming game”? Note that we now can identify van Fraassen’s question about “how do words and concepts acquire their reference?” as a misunderstanding, deeply informed by positivism itself. It does not make sense to pose that question in this way at all. There is not first a word which then, in a secondary process gets some reference or meaning attached. Such a concept is almost absurd. Similarly, the distinction between syntax and semantics, once introduced by the positivist Morris in the late 1940ies, is to be regarded as much the same pseudo-problem, established just by the fundamental and elemental assumptions of positivism itself: linear additivity, metaphysical independence and lossless separability of parts of wholenesses. If you scatter everything into single pieces of empirical dust, you will never be able to make any proposition anymore about the relations you destroyed before. That’s the actual reason for the problem of positivistic science and its failure.

In contrast to that we tend to propose a radically different picture of language, one that of course has been existing in many preformed flavors. Since we can’t transfer anything directly into one’s other mind, the only thing we can do is to invite or trigger processes of interpretation. In the chapter about vagueness we called words  “processual indicative” for slightly different reasons. Language is a highly structured, institutionalized and symbolized “demonstrating”, an invitation to interpret. Richard Brandom investigated in great detail [29] the processes and the roles of speakers and listeners in that process of mutual invitation for interpretation. The mutuality allows a synchronization, a resonance and a more or less strong resemblance between pairs of speaker-listeners and listener-speakers.

The “naming game” and its derivative, the “word game” is embedded into a context of “language games”. Actually, word games and language games are not as related as it might appear prima facie, at least beyond their common characteristics that we may label “game”. This becomes apparent if we ask what happens with the “physical” representative of a single word that we throw into our mechanisms. If there is no sentential context, or likewise no social context like a chat, then a lot of quite different variants of possible continuations are triggered. Calling out “London” our colleague in chatting may continue with “Jack London”  (the writer), “Jack the Ripper”, Chelsea, London Tower, Buckingham, London Heathrow, London Soho, London Stock Exchange, etc. but also Paris, Vienna, Berlin, etc., choices being slightly dependent on our mood, the thoughts we had before etc. In other words, the word that we bring to the foreground as a crisp entity behaves like a seedling: it is the starting point of a potential garden or forest, it functions as the root of the unfolding of a potential story (as a co-weaving of a network of abstract relations). Just to bring in another metaphorical representation: Words are like the initial traces of firework rockets, or the traces of elementary particles in statu nascendi as they can be observed in a bubble chamber: they promise a rich texture of upcoming events.

Understanding (Images, Words, …)

We have seen that “words” gain shape only as a result of a particular game, the “naming game”, which is embedded into a “language game”. Before those games are played, “words” do not exist as a discrete, crisp entity, say as a symbol, or a string of letters. Would they, we could not think. Even more than the “language game” the “naming game” works mainly as an invitation or as an acknowledged trigger for more or less constrained interpretation.

Now there are those enlightened language games of “understanding” and “explaining”. Both of them work just as any other part of speech do: they promise something. The claim to understand something refers to the ability for a potential preparation of a series of triggers that one additionally claim to be able to arrange in such a way as to support the gaining of the respective insight in my chat partner. Slightly derived from that understanding also could mean to transfer the structure of the underlying or overarching problematics to other contexts. This ability for adaptive reframing of a problematic setting is thus always accompanied by a demonstrativum, that is, by some abstract image, either by actual pictorial information or its imagination, or by its activity. Such a demonstrativum could be located completely within language itself, of course, which however is probably quite rare.

Ambiguity

It is clear that language does not work as a way to express logical predicates. Trying to do so needs careful preparations. Language can’t be “cured” and “cleaned” from ambiguities, trying to do so would establish a categorical misunderstanding. Any “disambiguation” happens as a resonating resemblance of at least two participants in language-word-gaming, mutually interpreting each other until both believe that their interest and their feelings match. An actual, so to speak objective match is neither necessary nor possible. In other words, language does not exist in two different forms, one without ambiguity and without metaphors, and the other form full of them. Language without metaphorical dynamics is not a language at all.

The interpretation of empirical phenomena, whether outside of language or concerning language itself, is never fully determinable. Quine called the idea of the possibility of such a complete determination a myth and as the “dogma of empiricism” [30]. Thus, given this underdetermination, it does not make any sense to expect that language should be isomorphic to logical predicates or propositions. Language is basically an instance of impredicativity. Elsewhere we already met the self-referentiality of language (its strong singularity) as another reason for this. Instead, we should expect that this fundamental empirical underdetermination is reflected appropriately in the structure of language, namely as analogical thinking, or quite related to that, as metaphorical thinking.

Ambiguity is not a property of language or words, it is a result, or better, a property of the process of interpretation at some arbitrarily chosen point in time. And that process takes place synchronously within a single brain/mind as well as between two brains/minds. Language is just the mediating instance of that intercourse.

“Intelligence”

It is now possible to clarify the ominous concept of “intelligence”. We find the concept in the name of a whole discipline (“Artificial Intelligence”), and it is at work behind the scenes in areas dubbed as “machine learning”. Else, there is the hype about the so-called “collective intelligence”. These observations, and of course our own intentions make it necessary to deal briefly with it, albeit we think that it is a misleading and inappropriate idea.

First of all one has to understand that “intelligence” is an operationalization of a research question, allowing for a measurement, hence for a quantitative comparison. It is questionable whether the mental qualities can be made quantitatively measurable without reducing them seriously. For instance, the capacity for I/O operations related to a particular task surely can’t be equaled with “intelligence”, even if it could be a necessary condition.

It is just silly to search for “intelligence” in machines or beings, or to assign more or less intelligence to any kind of entity. Intelligence as such does not “exist” independently of a cultural setup, we can’t find it “out there”. Ontology is, as always, not only a bad trail, it directly leads into the abyss of nonsense. The research question, by the way, was induced by the intention to proof that black people and women are less intelligent than white males.

Yet, even if we take “intelligence” in an adapted and updated form as the capability for autonomous generalization, it is a bad concept, simply because it does not allow to pose further reasonable questions. This directly follows from its characteristics of being itself an operationalization. Investigating the operationalization hardly brings anything useful to light about the pretended subject of interest.

The concept of intelligence arose in a strongly positivistic climate, where the positivism has been practiced even in a completely unreflected manner. Hence, their inventors have not been aware of the effect of their operationalization. The concept of intelligence implies a strong functional embedding of the respective, measured entity. Yet, dealing with language undeniably has something to do with higher mental abilities, but language is a strictly non-functional phenomenon. It does not matter here that positivists still claim the opposite. And who would stand up claiming that a particular move, e.g. in planning a city, or dealing with the earth’s climate, is more smart than another? In other words, the other strong assumption of positivism, measurability and identifiability, also fails dramatically when it comes to human affairs. And everything on this earth is a human affair.

Intelligence is only determinable relative to a particular Lebensform. It is thus not possible to “compare the intelligence” across individuals living in different contexts. This renders the concept completely useless, finally.

Conclusions

The hypothesis I have been arguing for in this essay claims that the trinity of waves, words and images plays a significant role in the ability to deal with language and for the emergence of higher mental abilities. I proposed first that this trinity is irreducible and second that is responsible for this ability in the sense of a necessary and sufficient condition. In order to describe the practicing of that trinity, for instance with regard to possible implementations, I introduced the term of “synpresentation”. This concept draws the future track of how to deal with words and images as far as it concerns machine-based episteme.

In more direct terms, we conclude that without the capability to deal with “names”, “words” and language, the attempt to mapping higher mental capacities onto machines will not experience any progress. Once the machine will have arrived such a level, it will find itself exactly in the same position as we as humans do. This capability is definitely not sufficiently defined by “calculation power”; indeed, such an idea is ridiculous. Without embedding into appropriate social intercourse, without solving the question of representation (contemporary computer science and its technology do NOT solve it, of course), even a combined 1020000 flops will not cause the respective machine or network of machines25 “intelligent” in any way.

Words and proper names are re-formulated as a particular form of “games”, though not as “language games”, but on a more elementary level as “naming game”. I have tried to argue how the problematics of the reference could be thought of to disappear as a pseudo-problem on the basis of such a reformulation.

Finally, we found important relationships to earlier discussions of concepts like the making of analogies or vagueness. We basically agree on the stance that language can’t be clarified and that it is inappropriate (“free of sense”) to assign any kind of predicativity to language. Bluntly spoken, the application of logic is the mind, and nowhere else. Communicating about this application is not based on a language any more, and similarly, projecting logic onto language destroys language. The idea of a scientific language is empty as it is the idea of a generally applicable and understandable language. A language that is not inventive could not be called such.

Notes

1. If you read other articles in this blog you might think that there is a certain redundancy in the arguments and the targeted issues. This is not the case, of course. The perspectives are always a bit different; such I hope that by the repeated attempt “to draw the face” (Ludwig Wittgenstein, ) the problematics is rendered more accurately. “How can one learn the truth by thinking? As one learns to see a face better if one

draws it.” ( Zettel §255, [1])

2. In one of the shortest articles ever published in the field of philosophy, Edmund Gettier [2] demonstrated that it is deeply inappropriate to conceive of knowledge as “justified true belief”. Yet, in the field of machine learning so-called “belief revision” is precisely and still following this untenable position. See also our chapter about the role of logic.

3. Michel Foucault “Dits et Ecrits” I 846 (dt.1075)  [3] cited after Bernhard Waldenfels [4] p.125

4. we will see that the distinction or even separation of the “symbolic” and the “material” is neither that clear nor is it simple. Fomr the side of the machine, Felix Guattari argued in favor for a particular quality [5], the machinic, which is roughly something like a mechanism in human affairs. From the side of the symbolic there is clearly the work of Edwina Taborsky to cite, who extended and deepened the work of Charles S. Peirce in the field of semiotics,

5. particularly homo erectus and  homo sapiens spec.

6. Humans of the species homo sapiens sapiens.

7. For the time being we leave this ominous term “intelligence” untouched, but I also will warn you about its highly problematic state. We will resolve this issue till the end of that essay.

8. Heidegger developed the figure of the “Gestell” (cf. [7]), which serves multiple purposes. It is providing a storage capacity, it is a tool for sort of well-ordered/organized hiding and unhiding (“entbergen”), it provides a scaffold for sorting things in and out, and thus it is working as a complex constraint on technological progress. See also Peter Sloterdijk on this topic [8].

9. elementarization regarding Descartes

10. Homo floresiensis, also called “hobbit man”, who lived on Flores, Indonesia, 600’000y till approx. 3’000y ago. Homo floresiensis derived from homo erectus. 600’000 years ago they obviously built a boat to transfer to the islands across a sea gate with strong currents. The interesting issue is that this endeavor requires a stable social structure, division of labor, and thus also language. Homo floresiensis had a particular fore brain anatomy which is believed to provide the “intelligence” while the overall brain was relatively small as compared to ours.

11. Concerning the “the enigma of brain-mind interaction” Eccles was an avowed dualist [11]. Consequently he searched for the “interface” between the mind and the brain, in which he was deeply inspired by the 3-world concept of Karl Popper. The “dualist” position held that the mind exists at least partially independently from and somehow outside the brain. Irrespective his contributions to neuroscience on the cellular level, these ideas (of Eccles and Popper) are just wild nonsense.

12. The Philosophical Investigations are probably the most important contribution to philosophy in the 20th century. The are often mistaken as a foundational document for analytic philosophy of language. Nothing is more wrong as to take Wittgenstein as a founding father of analytic philosophy, however. Many of the positions that refer to Wittgenstein (e.g. Kripke) are just low-quality caricatures of his work.

13. Blair’s book is a must read for any computer scientist, despite some problems in its conceptualization of information.

14. Goldman [14] provides a paradigmatic examples how psychologists constantly miss the point of philosophy, up today. In an almost arrogant tone he claims: “First, let me clarify my treatment of justificational rules, logic, and psychology. The concept of justified or rational belief is a core item on the agenda of philosophical epistemology. It is often discussed in terms of “rules” or “principles” of justification, but these have normally been thought of as derivable from deductive and inductive logic, probability theory, or purely autonomous, armchair epistemology.”

Markie [15] demonstrated that everything in these claims is wrong or mistaken. Our point about it is that something like “justification” is not possible in principle, but particularly it is not possible from an empirical perspective. Goldman’s secretions to the foundations of his own work are utter nonsense (till today).

15. It is one of the rare (but important) flaws in Blair’s work that he assimilates the concept of “information retrieval” in an unreflected manner. Neither it is reasonable to assign an ontological quality to information (we can not say that information “exists”, as this would deny the primacy of interpretation) nor can we then say that information can be “retrieved”. See also our chapter about his issue. Despite his largely successful attempt to argue in favor of the importance of Wittgenstein’s philosophy for computer science, Blair fails to recognize that ontology is not tenable at large, but particularly for issues around “information”. It is a language game, after all.

16 see Stanford Encyclopedia for a discussion of various positions.

17. In our investigation of models and their generalized form, we stressed the point that there are no apriori fixed “properties” of a measured (perceived) thing; instead we have to assign the criteria for measurement actively, hence we call these criteria assignates instead of “properties”, “features”, or “attributes”.

18. See our essay about logic.

20. See the entry in the Stanford Encyclopedia of Philosophy about Quine. Quine in “Word and Object” gives the following example (abridged version here). Imagine, you discovered a formerly unknown tribe of friendly people. Nobody knows their language. You accompany one of them hunting. Suddenly a hare rushes along, crossing your way. The hunter immediately points to the hare, shouting “Gavagai!” What did he mean? Funny enough, this story happened in reality. British settlers in Australia wondered about those large animals hopping around. They asked the aborigines about the animal and its name. The answer was “cangaroo” – which means “I do not understand you” in their language.

21. This, of course, resembles to Bergson, who, in Matter and Memory [23], argued that any thinking and understanding takes place by means of primary image-like “representations”. As Leonard Lawlor (Henri Bergson@Stanford) summarizes, Bergson conceives of knowledge as “knowledge of things, in its pure state, takes place within the things it represents.” We would not describe out principle of associativity as it can be be realized by SOMs very differently…

22. the main difference between “intension” and “concept” is that the former still maintains a set of indices to raw observations of external entities, while the latter is completely devoid of such indices.

23. We conceived randolations as pre-specific relations; one may also think of them as probabilistic quasi-species that eventually may become discrete on behalf of some measurement. The intention for conceiving of randolations is given by the central drawback of relations: their double-binary nature presumes apriori measurability and identifiability, something that is not appropriate when dealing with language.

24. “raw” is indeed very relative, especially if we take culturally transformed or culturally enabled percepts into account;

25. There are mainly two aspects about that: (1) large parts of the internet is organized as a hierarchical network, not as an associative network; nowadays everybody should know that telephone network did not, do not and will not develop “intelligence”; (2) so-called Grid-computing is always organized as a linear, additive division of labor; such, it allows to run processes faster, but no qualitative change is achieved, as it can be observed for instance in the purely size-related contrast between a mouse and an elephant. Thus, taken (1) and (2) together, we may safely conclude that doing wrong things (=counting Cantoric dust) with a high speed will not produce anything capable for developing a capacity to understand anything.

References

  • [1] Ludwig Wittgenstein, Zettel. Oxford, Basil Blackwell, 1967. Edited by G.E.M. Anscombe and G.H. von Wright, translated by G.E.M. Anscombe.
  • [2] Edmund Gettier (1963), Is Justified True Belief Knowledge? Analysis 23: 121-123.
  • [3] Michel Foucault “Dits et Ecrits”, Vol I.
  • [4] Bernhard Waldenfels, Idiome des Denkens. Suhrkamp, Frankfurt 2005.
  • [5] Henning Schmidgen (ed.), Aesthetik und Maschinismus, Texte zu und von Felix Guattari. Merve, Berlin 1995.
  • [6] David Blair, Wittgenstein, Language and Information – Back to the Rough Ground! Springer Series on Information Science and Knowledge Management, Vol.10, New York 2006.
  • [7] Martin Heidegger, The Question Concerning Technology and Other Essays. Harper, New York 1977.
  • [8] Peter Sloterdijk, Nicht-gerettet, Versuche nach Heidegger. Suhrkamp, Frankfurt 2001.
  • [9] Hermann Haken, Synergetik. Springer, Berlin New York 1982.
  • [10] R. Graham, A. Wunderlin (eds.): Lasers and Synergetics. Springer, Berlin New York 1987.
  • [11] John Eccles, The Understanding of the Brain. 1973.
  • [12] Douglas Hofstadter, Fluid Concepts And Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought. Basic Books, New York 1996.
  • [13] Robert van Rooij, Vagueness, Tolerance and Non-Transitive Entailment. p.205-221 in: Petr Cintula, Christian G. Fermüller, Lluis Godo, Petr Hajek (eds.) Understanding Vagueness. Logical, Philosophical and Linguistic Perspectives. Vol.36 of Studies in Logic, College Publications, London 2011. book is avail online.
  • [14] Alvin I. Goldman (1988), On Epistemology and Cognition, a response to the review by S.W. Smoliar. Artificial Intelligence 34: 265-267.
  • [15] Peter J. Markie (1996). Goldman’s New Reliabilism. Philosophy and Phenomenological Research Vol.56, No.4, pp. 799-817
  • [16] Saul Kripke, Naming and Necessity. 1972.
  • [17] Scott Soames, Beyond Rigidity: The Unfinished Semantic Agenda of Naming and Necessity. Oxford University Press, Oxford 2002.
  • [18] Scott Soames (2006), Précis of Beyond Rigidity. Philosophical Studies 128: 645–654.
  • [19] Michel Foucault, Les Hétérotopies – [Radio Feature 1966]. Youtube.
  • [20] Michel Foucault, Die Heterotopien. Der utopische Körper. Aus dem Französischen von Michael Bischoff, Suhrkamp, Frankfurt 2005.
  • [21] David Grahame Shane, Recombinant Urbanism – Conceptual Modeling in Architecture, Urban Design and City Theory. Wiley Academy Press, Chichester 2005.
  • [22] Willard van Orman Quine, Word and Object. M.I.T. Press, Cambridge (Mass.) 1960.
  • [23] Henri Louis Bergson, Matter and Memory. transl. Nancy M. Paul  & W. Scott Palmer, Martino Fine Books, Eastford  (CT) 2011 [1911].
  • [24] Fernando  Colmenares, Helena Rivero (1986).  A conceptual Model for Analysing Interactions in Baboons: A Preliminary Report. pp.63-80. in: Colgan PW, Zayan R (eds.), Quantitative models in ethology. Privat I.E, Toulouse.
  • [25] Irene Pepperberg (1998). Talking with Alex: Logic and speech in parrots. Scientific American. avail online. see also the Wiki entry about Alex.
  • [26] a. Robert Seyfarth, Dorothy Cheney, Peter Marler (1980). Monkey Responses to Three Different Alarm Calls: Evidence of Predator Classification and Semantic Communication. Science, Vol.210: 801-803.b. Dorothy L. Cheney, Robert M. Seyfarth (1982). How vervet monkeys perceive their grunts: Field playback experiments. Animal Behaviour 30(3): 739–751.
  • [27] Robert Seyfarth, Dorothy Cheney (1990). The assessment by vervet monkeys of their own and another species’ alarm calls. Animal Behaviour 40(4): 754–764.
  • [28] Klaus Wassermann (2010). Nodes, Streams and Symbionts: Working with the Associativity of Virtual Textures. The 6th European Meeting of the Society for Literature, Science, and the Arts, Riga, 15-19 June, 2010. available online.
  • [29] Richard Brandom, Making it Explicit. Harvard University Press, Cambridge (Mass.) 1998.
  • [30] Willard van Orman Quine (1951), Two Dogmas of Empiricism. Philosophical Review, 60: 20–43. available here

۞

Analogical Thinking, revisited. (II)

March 20, 2012 § Leave a comment

In this second part of the essay about a fresh perspective on

(II/II)

analogical thinking—more precise: on models about it—we will try to bring two concepts together that at first sight represent quite different approaches: Copycat and SOM.

Why engaging in such an endeavor? Firstly, we are quite convinced that FARG’s Copycat demonstrates an important and outstanding architecture. It provides a well-founded proposal about the way we humans apply ideas and abstract concepts to real situations. Secondly, however, it is also clear that Copycat suffers from a few serious flaws in its architecture, particularly the built-in idealism. This renders any adaptation to more realistic domains, or even to completely domain-independent conditions very, very difficult, if not impossible, since this drawback also prohibits structural learning. So far, Copycat is just able to adapt some predefined internal parameters. In other words, the Copycat mechanism just adapts a predefined structure, though a quite abstract one, to a given empiric situation.

Well, basically there seem to be two different, “opposite” strategies to merge these approaches. Either we integrate the SOM into Copycat, or we try to transfer the relevant yet to be identified parts from Copycat to a SOM-based environment. Yet, at the end of day we will see that and how the two alternatives converge.

In order to accomplish our goal of establishing a fruitful combination between SOM and Copycat we have to take mainly three steps. First, we briefly recapitulate the basic elements of Copycat and the proper instance of a SOM-based system. We also will describe the extended SOM system in some detail, albeit there will be a dedicated chapter on it. Finally, we have to transfer and presumably adapt those elements of the Copycat approach that are missing in the SOM paradigm.

Crossing over

The particular power of (natural) evolutionary processes derives from the fact that it is based on symbols. “Adaptation” or “optimization” are not processes that change just the numerical values of parameters of formulas. Quite to the opposite, adaptational processes that span across generations parts of the DNA-based story is being rewritten, with potential consequences for the whole of the story. This effect of recombination in the symbolic space is particularly present in the so-called “crossing over” during the production of gamete cells in the context of sexual reproduction in eukaryotes. Crossing over is a “technique” to dramatically speed up the exploration of the space of potential changes. (In some way, this space is also greatly enlarged by symbolic recombination.)

What we will try here in our attempt to merge the two concepts of Copycat and SOM is exactly this: a symbolic recombination. The difference to its natural template is that in our case we do not transfer DNA-snippets between homologous locations in chromosomes, we transfer whole “genes,” which are represented by elements.

Elementarizations I: C.o.p.y.c.a.t.

In part 1 we identified two top-level (non-atomic) elements of Copycat

Since the first element, covering evolutionary aspects such as randomness, population and a particular memory dynamics, is pretty clear and a whole range of possible ways to implement it are available, any attempt for improving the Copycat approach has to target the static, strongly idealistic characteristics of the the structure that is called “Slipnet” by the FARG’s. The Slipnet has to be enabled for structural changes and autonomous adaptation of its parameters. This could be accomplished in many ways, e.g. by representing the items in the Slipnet as primitive artificial genes. Yet, we will take a different road here, since the SOM paradigm already provides the means to achieve idealizations.

At that point we have to elementarize Copycat’s Slipnet in a way that renders it compatible with the SOM principles. Hofstadter emphasizes the following properties of the Slipnet and the items contained therein (pp.212).

  • (1) Conceptual depth allows for a dynamic and continuous scaling of “abstractness” and resistance against “slipping” to another concept;
  • (2) Nodes and links between nodes both represent active abstract properties;
  • (3) Nodes acquire, spread and lose activation, which knows an switch-on threshold < 1;
  • (4) The length of links represents conceptual proximity or degree of association between the nodes.

As a whole, and viewed from the network perspective, the Slipnet behaves much like a spring system, or a network built from rubber bands, where the springs or the rubber bands are regulated in their strength. Note that our concept of SomFluid also exhibits the feature of local regulation of the bonds between nodes, a property that is not present in the idealized standard SOM paradigm.

Yet, the most interesting properties in the list above are (1) and (2), while (3) and (4) are known in the classic SOM paradigm as well. The first item is great because it represents an elegant instance of creating the possibility for measurability that goes far beyond the nominal scale. As a consequence, “abstractness” ceases to be nominal none-or-all property, as it is present in hierarchies of abstraction. Such hierarchies now can be recognized as mere projections or selections, both introducing a severe limitation of expressibility. The conceptual depth opens a new space.

The second item is also very interesting since it blurs the distinction between items and their relations to some extent. That distinction is also a consequence of relying too readily on the nominal scale of description. It introduces a certain moment of self-reference, though this is not fully developed in the Slipnet. Nevertheless, a result of this move is that concepts can’t be thought without their embedding into other a neighborhood of other concepts. Hofstadter clearly introduces a non-positivistic and non-idealistic notion here, as it establishes a non-totalizing meta-concept of wholeness.

Yet, the blurring between “concepts” and “relations” could be and must be driven far beyond the level Hofstadter achieved, if the Slipnet should become extensible. Namely, all the parts and processes of the Slipnet need to follow the paradigm of probabilization, since this offers the only way to evade the demons of cybernetic idealism and control apriori. Hofstadter himself relies much on probabilization concerning the other two architectural parts of Copycat. Its beyond me why he didn’t apply it to the Slipnet too.

Taken together, we may derive (or: impose) the following important elements for an abstract description of the Slipnet.

  • (1) Smooth scaling of abstractness (“conceptual depth”);
  • (2) Items and links of a network of sub-conceptual abstract properties are instances of the same category of “abstract property”;
  • (3) Activation of abstract properties represents a non-linear flow of energy;
  • (4) The distance between abstract properties represents their conceptual proximity.

A note should be added regarding the last (fourth) point. In Copycat, this proximity is a static number. In Hofstadter’s framework, it does not express something like similarity, since the abstract properties are not conceived as compounds. That is, the abstract properties are themselves on the nominal level. And indeed, it might appear as rather difficult to conceive of concepts as “right of”, “left of”, or “group” as compounds. Yet, I think that it is well possible by referring to mathematical group theory, the theory of algebra and the framework of mathematical categories. All of those may be subsumed into the same operationalization: symmetry operations. Of course, there are different ways to conceive of symmetries and to implement the respective operationalizations. We will discuss this issue in a forthcoming essay that is part of the series “The Formal and the Creative“.

The next step is now to distill the elements of the SOM paradigm in a way that enables a common differential for the SOM and for Copycat..

Elementarizations II: S.O.M.

The self-organizing map is a structure that associates comparable items—usually records of values that represent observations—according to their similarity. Hence, it makes two strong and important assumptions.

  • (1) The basic assumption of the SOM paradigm is that items can be rendered comparable;
  • (2) The items are conceived as tokens that are created by repeated measurement;

The first assumption means, that the structure of the items can be described (i) apriori to their comparison and (ii) independent from the final result of the SOM process. Of course, this assumption is not unique to SOMs, any algorithmic approach to the treatment of data is committed to it. The particular status of SOM is given by the fact—and in stark contrast to almost any other method for the treatment of data—that this is the only strong assumption. All other parameters can be handled in a dynamic manner. In other words, there is no particular zone of the internal parametrization of a SOM that would be inaccessible apriori. Compare this with ANN or statistical methods, and you feel the difference…  Usually, methods are rather opaque with respect to their internal parameters. For instance, the similarity functional is usually not accessible, which renders all these nice looking, so-called analytic methods into some kind of subjective gambling. In PCA and its relatives, for instance, the similarity is buried in the covariance matrix, which in turn is only defined within the assumption of normality of correlations. If not a rank correlation is used, this assumption is extended even to the data itself. In both cases it is impossible to introduce a different notion of similarity. Else, and also as a consequence of that, it is impossible to investigate the particular dependency of the results proposed by the method from the structural properties and (opaque) assumptions. In contrast to such unfavorable epistemo-mythical practices, the particular transparency of the SOM paradigm allows for critical structural learning of the SOM instances. “Critical” here means that the influence of internal parameters of the method onto the results or conclusions can be investigated, changed, and accordingly adapted.

The second assumption is implied by its purpose to be a learning mechanism. It simply needs some observations as results of the same type of measurement. The number of observations (the number of repeats) has to  exceed a certain lower threshold, which, dependent on the data and the purpose, is at least 8, typically however (much) more than 100 observations of the same kind are needed. Any result will be within the space delimited by the assignates (properties), and thus any result is a possibility (if we take just the SOM itself).

The particular accomplishment of a SOM process is the transition from the extensional to the intensional description, i.e. the SOM may be used as a tool to perform the step from tokens to types.

From this we may derive the following elements of the SOM:1

  • (1) a multitude of items that can be described within a common structure, though not necessarily an identical one;
  • (2) a dense network where the links between nodes are probabilistic relations;
  • (3) a bottom-up mechanism which results in the transition from an extensional to an intensional level of description;

As a consequence of this structure the SOM process avoids the necessity to compare all items (N) to all other items (N-1). This property, together with the probabilistic neighborhoods establishes the main difference to other clustering procedures.

It is quite important to understand that the SOM mechanism as such is not a modeling procedure. Several extensions have to be added and properly integrated, such as

  • – operationalization of the target into a target variable;
  • – validation by separate samples;
  • – feature selection, preferably by an instance of  a generalized evolutionary process (though not by a genetic algorithm);
  • – detecting strong functional and/or non-linear coupling between variables;
  • – description of the dependency of the results from internal parameters by means of data experiments.

We already described the generalized architecture of modeling as well as the elements of the generalized model in previous chapters.

Yet, as we explained in part 1 of this essay, analogy making is conceptually incompatible to any kind of modeling, as long as the target of the model points to some external entity. Thus, we have to choose a non-modeling instance of a SOM as the starting point. However, clustering is also an instance of those processes that provide the transition from extensions to intensions, whether this clustering is embedded into full modeling or not. In other words, both the classic SOM as well as the modeling SOM are not suitable as candidates for a merger with Copycat.

SOM-based Abstraction

Fortunately, there is already a proposal, and even a well-known one, that indeed may be taken as such a candidate: the two-layer SOM (TL-SOM) as it has been demonstrated as essential part of the so-called WebSom [1,2].

Actually, the description as being “two layered” is a very minimalistic, if not inappropriate description what is going on in the WebSom. We already discussed many aspects of its architecture here and here.

Concerning our interests here, the multi-layered arrangement itself is not a significant feature. Any system doing complicated things needs a functional compartmentalization; we have met a multi-part, multi-compartment and multi-layered structure in the case of Copycat too. Else, the SOM mechanism itself remains perfectly identical across the layers.

The real interesting features of the approach realized in the TL-SOM are

  • – the preparation of the observations into probabilistic contexts;
  • – the utilization of the primary SOM as a measurement device (the actual trick).

The domain of application of the TL-SOM is the comparison and classification of texts. Texts belong to unstructured data and the comparison of texts is exposed to the same problematics as the making of analogies: there is no apriori structure that could serve as a basis for modeling. Also, as the analogies investigated by the FARG the text is a locational phenomenon, i.e. it takes place in a space.

Let us briefly recapitulate the dynamics in a TL-SOM. In order to create a TL-SOM the text is first dissolved into overlapping, probabilistic contexts. Note that the locational arrangement is captured by these random contexts. No explicit apriori rules are necessary to separate patterns. The resulting collection of  contexts then gets “somified”. Each node then contains similar random contexts that have been derived from various positions in different texts. Now the decisive step will be taken, which consists in turning the perspective by “90 degrees”: We can use the SOM as the basis for creating a histogram for each of the texts. The nodes are interpreted as properties of the texts, i.e. each node represents a bin of the histogram. The values of the individual bins measure the frequency of the text as it is represented by the respective random context. The secondary SOM then creates a clustering across these histograms, which represent the texts in an abstract manner.

This way the primary lattice of the TL-SOM is used to impose a structure on the unstructured entity “text.”

Figure 1: A schematic representation of a two-layered SOM with built-in self-referential abstraction. The input for the secondary SOM (foreground) is derived as a collection of histograms that are defined as a density across the nodes of the primary SOM (background). The input for the primary SOM are random contexts.

To put it clearly: the secondary SOM builds an intensional description of entities that results from the interaction of a SOM with a probabilistic description of the empirical observations. Quite obviously, intensions built this way about intensions are not only quite abstract, the mechanism could even be stacked. It could be described as “high-level perception” as justified as Hofstadter uses the term for Copycat. The TL-SOM turns representational intensions into abstract, structural ones.

The two aspects from above thus interact, they are elements of the TL-SOM. Despite the fact that there are still transitions from extensions to intensions, we also can see that the targeted units of the analysis, the texts get probabilistically distributed across an area, the lattice of the primary SOM. Since the SOM maps the high-dimensional input data onto its map in a way that preserves their topological properties, it is easy to recognize that the TL-SOM creates conceptual halos as an intermediate.

So let us summarize the possibilities provided by the SOM.

  • (1) SOMs are able to create non-empiric, or better: de-empirified idealizations of intensions that are based on “quasi-empiric” input data;
  • (2) TL-SOMs can be used to create conceptual halos.

In the next section we will focus on this spatial, better: primarily spatial effect.

The Extended SOM

Kohonen and co-workers [1,2] proposed to build histograms that reflect the probability density of a text across the SOM. Those histograms represent the original units (e.g. texts) in a quite static manner, using a kind of summary statistics.

Yet, texts are definitely not a static phenomenon. At first sight there is at least a series, while more appropriately texts are even described as dynamic networks of own associative power [3]. Returning to the SOM we see that additionally to the densities scattered across the nodes of the SOM we also can observe a sequence of invoked nodes, according to the sequence of random contexts in the text (or the serial observations)

The not so difficult question then is: How to deal with that sequence? Obviously, it is again and best conceived as a random process (though with a strong structure), and random processes are best described using Markov models, either as hidden (HMM) or as transitional models. Note that the Markov model is not a model about the raw observational data, it describes the sequence of activation events of SOM nodes.

The Markov model can be used as a further means to produce conceptual halos in the sequence domain. The differential properties of a particular sequence as compared to the Markov model then could be used as further properties to describe the observational sequence.

(The full version of the extended SOM comprises targeted modeling as a further level. Yet, this targeted modeling does not refer to raw data. Instead, its input is provided completely by the primary SOM, which is based on probabilistic contexts, while the target of such modeling is just internal consistency of a context-dependent degree.)

The Transfer

Just to avoid misunderstanding: it does not make sense to try representing Copycat completely by a SOM-based system. The particular dynamics and phenomenologically behavior depends a lot on Copycat’s tripartite morphology as represented by the Coderack (agents), the Workspace and the Slipnet. We are “just” in search for a possibility to remove the deep idealism from the Slipnet in order to enable it for structural learning.

Basically, there are two possible routes. Either we re-interpret the extended SOM in a way that allows us to represent the elements of the Slipnet as properties of the SOM, or we try to replace the all items in the Slipnet by SOM lattices.

So, let us take a look which structures we have (Copycat) or what we could have (SOM) on both sides.

Table 1: Comparing elements from Copycat’s Slipnet to the (possible) mechanisms in a SOM-based system.

Copycat extended SOM
 1. smoothly scaled abstraction Conceptual depth (dynamic parameter) distance of abstract intensions in an integrated lattice of a n-layered SOM
 2.  Links as concepts structure by implementation reflecting conceptual proximity as an assignate property for a higher-level
 3. Activation featuring non-linear switching behavior structure by implementation x
 4. Conceptual proximity link length (dynamic parameter) distance in map (dynamic parameter)
 5.  Kind of concepts locational, positional symmetries, any

From this comparison it is clear that the single most challenging part of this route is the possibility for the emergence of abstract intensions in the SOM based on empirical data. From the perspective of the SOM, relations between observational items such as “left-most,” “group” or “right of”, and even such as “sameness group” or “predecessor group”, are just probabilities of a pattern. Such patterns are identified by functions or dynamic combinations thereof. Combinations ot topological primitives remain mappable by analytic functions. Such concepts we could call “primitive concepts” and we can map these to the process of data transformation and the set of assignates as potential properties.2 It is then the job of the SOM to assign a relevancy to the assignates.

Yet, Copycat’s Slipnet comprises also rather abstract concepts such as “opposite”. Further more, the most abstract concepts often act as links between more primitive concepts, or, in Hofstadter terms, conceptual items of lower “conceptual depth”.

My feeling here is that it is a fundamental mistake to implement concepts like “opposite” directly. What is opposite of something else is a deeply semantic concept in itself, thus strongly dependent on the domain. I think that most of the interesting concepts, i.e. the most abstract ones are domain-specific. Concepts like “opposite” could be considered as something “simple” only in case of geometric or spatial domains.

Yet, that’s not a weakness. We should use this as a design feature. Take the following rather simple case as shown in the next figure as an example. Here we mapped simply triplets of uniformly distributed random values onto a SOM. The three values can be readily interpreted as parts of a RGB value, which renders the interpretation more intuitive. The special thing here is that the map has been a really large one: We defined approximately 700’000 nodes and fed approx. 6 million observations into it.

Figure 2: A SOM-based color map showing emergence of abstract features. Note that the topology of the map is a borderless toroid: Left and right borders touch each other (distance=0), and the same applies to the upper and lower borders.

We can observe several interesting things. The SOM didn’t come up with just any arbitrary sorting of the colors. Instead, a very particular one emerged.

First, the map is not perfectly homogeneous anymore. Very large maps tend to develop “anisotropies”, symmetry breaks if you like, simply due to the fact the the signal horizon becomes an important issue. This should not be regarded as a deficiency though. Symmetry breaks are essential for the possibility of the emergence of symbols. Second, we can see that two “color models” emerged, the RGB model around the dark spot in the lower left, and the YMC model around the bright spot in the upper right. Third, the distance between the bright, almost white spot and the dark, almost black one is maximized.

In other words, and not quite surprising, the conceptual distance is reflected as a geometrical distance in the SOM. As it is the case in the TL-SOM, we now could use the SOM as a measurement device that transforms an unknown structure into an internal property, simply by using the locational property in the SOM as an assignate for a secondary SOM. In this way we not only can represent “opposite”, but we even have a model procedure for “generalized oppositeness” at out disposal.

It is crucial to understand this step of “observing the SOM”, thereby conceiving the SOM as a filter, or more precisely as a measurement device. Of course, at this point it becomes clear that a large variety of such transposing and internal-virtual measurement devices may be thought of. Methodologically, this opens an orthogonal dimension to the representation of data, resembling strongly to the concept of orthoregulation.

The map shown above even allows to create completely different color models, for instance one around yellow and another one around magenta. Our color psychology is strongly determined by the sun’s radiated spectrum and hence it reflects a particular Lebenswelt; yet, there is no necessity about it. Some insects like bees are able to perceive ultraviolet radiation, i.e. their colors may have 4 components, yielding a completely different color psychology, while the capability to distinguish colors remains perfectly.3

“Oppositeness” is just a “simple” example for an abstract concept and its operationalization using a SOM. We already mentioned the “serial” coherence of texts (and thus of general arguments) that can be operationalized as sort of virtual movement across a SOM of a particular level of integration.

It is crucial to understand that there is no other model besides the SOM that combines the ability to learn from empirical data and the possibility for emergent abstraction.

There is yet another lesson that we can take home from the simple example above. Well, the example doesn’t not remain that simple. High-level abstraction, items of considerable conceptual depth, so to speak, requires rather short assignate vectors. In the process of learning qua abstraction it appears to be essential that the masses of possible assignates derived from or imposed by measurement of raw data will be reduced. On the one hand, empiric contexts from very different domains should be abstracted, i.e. quite literally “reduced”, into the same perspective. On the other hand, any given empiric context should be abstracted into (much) more than just one abstract perspective. The consequence of that is that we need a lot of SOMs, all separated “sufficiently” from each other. In other words, we need a dynamic population of Self-organizing maps in order to represent the capability of abstraction in real-life. “Dynamic population” here means that there are developmental mechanisms that result in a proliferation, almost a breeding of new SOM instances in a seamless manner. Of course, the SOM instances themselves have to be able to grow and to differentiate, as we have described it here and here.

In a population of SOM the conceptual depth of a concept may be represented by the efforts to arrive at a particular abstract “intension.” This not only comprises the ordinary SOM lattices, but also processes like Markov models, simulations, idealizations qua SOMs, targeted modeling, transition into symbolic space, synchronous or potential activations of other SOM compartments etc. This effort may be represented finally as a “number.”

Conclusions

The structure of multi-layered system of Self-organizing Maps as it has been proposed by Kohonen and co-workers is a powerful model to represent emerging abstraction in response to empiric impressions. The Copycat model demonstrates how abstraction could be brought back to the level of application in order to become able to make analogies and to deal with “first-time-exposures”.

Here we tried to outline a potential path to bring these models together. We regard this combination in the way we proposed it (or a quite similar one) as crucial for any advance in the field of machine-based episteme at large, but also for the rather confined area of machine learning. Attempts like that of Blank [4] appear to suffer seriously from categorical mis-attributions. Analogical thinking does not take place on the level of single neurons.

We didn’t discuss alternative models here (so far, a small extension is planned). The main reasons are that first it would be an almost endless job, and second that Hofstadter already did it and as a result of his investigation he dismissed all the alternative approaches (from authors like Gentner, Holyoak, Thagard). For an overview Runco [5] about recent models on creativity, analogical thinking, or problem solving provides a good starting point. Of course, many authors point to roughly the same direction as we did here, but mostly, the proposals are circular, not helpful because the problematic is just replaced by another one (e.g. the infamous and completely unusable “divergent thinking”), or can’t be implemented for other reasons. Thagard [6] for instance, claim that a “parallel satisfaction of the constraints of similarity, structure and purpose” is key in analogical thinking. Given our analysis, such statements are nothing but a great mess, mixing modeling, theory, vagueness and fluidity.

For instance, in cognitive psychology and in the field of artificial intelligence as well, the hypothesis of Structural Mapping (STM) finds a lot of supporters [7]. Hofstadter discusses similar approaches in his book. The STM hypothesis is highly implausible and obviously a left-over of the symbolic approach to Artificial Intelligence, just transposed into more structural regions. The STM hypothesis has not only to be implemented as a whole, it also has to be implemented for each domain specifically. There is no emergence of that capability.

The combination of the extended SOM—interpreted as a dynamic population of growing SOM instances—with the Copycat mechanism indeed appears as a self-sustaining approach into proliferating abstraction and—quite significant—back from it into application. It will be able to make analogies on any field already in its first encounter with it, even regarding itself, since both the extended SOM as well as the Copycat comprise several mechanisms that may count as precursors of high-level reflexivity.

After this proposal little remains to be said on the technical level. One of those issues which remain to be discussed is the conditions for the possibility of binding internal processes to external references. Here our favorite candidate principle is multi-modality, that is the joint and inextricable “processing” (in the sense of “getting affected”) of words, images and physical signals alike. In other words, I feel that we have come close to the fulfillment of the ariadnic question this blog:”Where is the Limit?” …even in its multi-faceted aspects.

A lot of implementation work has now to be performed, eventually commented by some philosophical musings about “cognition”, or more appropriate the “epistemic condition.” I just would like to invite you to stay tuned for the software publications to come (hopefully in the near future).

Notes

1. see also the other chapters about the SOM, SOM-based modeling, and generalized modeling.

2. It is somehow interesting that in the brain of many animals we can find very small groups of neurons, if not even single neurons, that respond to primitive features such as verticality of lines, or the direction of the movement of objects in the visual field.

3. Ludwig Wittgenstein insisted all the time that we can’t know anything about the “inner” representation of “concepts.” It is thus free of any sense and meaning to claim knowledge about the inner state of oneself as well as of that of others. Wilhelm Vossenkuhl introduces and explains the Wittgensteinian “grammatical” solipsism carefully and in a very nice way.[8]  The only thing we can know about inner states is that we use certain labels for it, and the only meaning of emotions is that we do report them in certain ways. In other terms, the only thing that is important is the ability to distinguish ones feelings. This, however, is easy to accomplish for SOM-based systems, as we have been demonstrating here and elsewhere in this collection of essays.

4. Don’t miss Timo Honkela’s webpage where one can find a lot of gems related to SOMs! The only puzzling issue about all the work done in Helsinki is that the people there constantly and pervasively misunderstand the SOM per se as a modeling tool. Despite their ingenuity they completely neglect the issues of data transformation, feature selection, validation and data experimentation, which all have to be integrated to achieve a model (see our discussion here), for a recent example see here, or the cited papers about the Websom project.

  • [1] Timo Honkela, Samuel Kaski, Krista Lagus, Teuvo Kohonen (1997). WEBSOM – Self-Organizing Maps of Document Collections. Neurocomputing, 21: 101-117.4
  • [2] Krista Lagus, Samuel Kaski, Teuvo Kohonen in Information Sciences (2004)
    Mining massive document collections by the WEBSOM method. Information Sciences, 163(1-3): 135-156. DOI: 10.1016/j.ins.2003.03.017
  • [3] Klaus Wassermann (2010). Nodes, Streams and Symbionts: Working with the Associativity of Virtual Textures. The 6th European Meeting of the Society for Literature, Science, and the Arts, Riga, 15-19 June, 2010. available online.
  • [4 ]Douglas S. Blank, Implicit Analogy-Making: A Connectionist Exploration.Indiana University Computer Science Department. available online.
  • [5] Mark A. Runco, Creativity-Research, Development, and Practice Elsevier 2007.
  • [6] Keith J. Holyoak and Paul Thagard, Mental Leaps: Analogy in Creative Thought.
    MIT Press, Cambridge 1995.
  • [7] John F. Sowa, Arun K. Majumdar (2003), Analogical Reasoning.  in: A. Aldo, W. Lex, & B. Ganter (eds.), “Conceptual Structures for Knowledge Creation and Communication,” Proc.Intl.Conf.Conceptual Structures, Dresden, Germany, July 2003.  LNAI 2746, Springer New York 2003. pp. 16-36. available online.
  • [8] Wilhelm Vossenkuhl. Solipsismus und Sprachkritik. Beiträge zu Wittgenstein. Parerga, Berlin 2009.

.

Ideas and Machinic Platonism

March 1, 2012 § Leave a comment

Once the cat had the idea to go on a journey…
You don’t believe me? Did not your cat have the same idea? Or is your doubt about my believe that cats can have ideas?

So, look at this individual here, who is climbing along the facade, outside the window…

(sorry for the spoken comment being available only in German language in the clip, but I am quite sure you got the point anyway…)

Cats definitely know about the height of their own position, and this one is climbing from flat to flat … outside, on the facade of the building, and in the 6th floor. Crazy, or cool, respectively, in its full meaning, this cat here, since it looks like she has been having a plan… (of course, anyone ever lived together with a cat knows very well that they can have plans… proudness like this one, and also remorse…)

Yet, how would your doubts look like, if I would say “Once the machine got the idea…” ? Probably you would stop talking or listening to me, turning away from this strange guy. Anyway, just that is the claim here, and hence I hope you keep reading.

We already discussed elsewhere1 that it is quite easy to derive a bunch of hypotheses about empirical data. Yet, deriving regularities or rules from empirical data does not make up an idea, or a concept. At most they could serve as kind of qualified precursors for the latter. Once the subject of interest has been identified, deriving hypotheses about it is almost something mechanical. Ideas and concepts as well are much more related to the invention of a problematics, as Deleuze has been working out again and again, without being that invention or problematics. To overlook (or to negate?) that difference between the problematic and the question is one of the main failures of logical empiricism, and probably even of today’s science.

The Topic

But what is it then, that would make up an idea, or a concept? Douglas Hofstadter once wrote [1] that we are lacking a concept of concept. Since then, a discipline emerged that calls itself “formal concept analysis”. So, actually some people indeed do think that concepts could be analyzed formally. We will see that the issues about the relation between concepts and form are quite important. We already met some aspects of that relationship in the chapters about formalization and creativity. And we definitely think that formalization expels anything interesting from that what probably had been a concept before that formalization. Of course, formalization is an important part in thinking, yet it is importance is restricted before it there are concepts or after we have reduced them into a fixed set of finite rules.

Ideas

Ideas are almost annoying, I mean, as a philosophical concept, and they have been so since the first clear expressions of philosophy. From the very beginning there was a quarrel not only about “where they come from,” but also about their role with respect to knowledge, today expressed as . Very early on in philosophy two seemingly juxtaposed positions emerged, represented by the philosophical approaches of Platon and Aristotle. The former claimed that ideas are before perception, while for the latter ideas clearly have been assigned the status of something derived, secondary. Yet, recent research emphasized the possibility that the contrast between them is not as strong as it has been proposed for more than 2000 years. There is an eminent empiric pillar in Platon’s philosophical building [2].

We certainly will not delve into this discussion here, it simply would take too much space and efforts, and not to the least there are enough sources in the web displaying the traditional positions in great detail. Throughout history since Aristotle, many and rather divergent flavors of idealism emerged. Whatever the exact distinctive claim of any of those positions is, they all share the belief in the dominance into some top-down principle as essential part of the conditions for the possibility of knowledge, or more general the episteme. Some philosophers like Hegel or Frege, just as others nowadays being perceived as members of German Idealism took rather radical positions. Frege’s hyper-platonism, probably the most extreme idealistic position (but not exceeding Hegel’s “great spirit” that far) indeed claimed that something like a triangle exists, and quite literally so, albeit in a non-substantial manner, completely independent from any, e.g. human, thought.

Let us fix this main property of the claim of a top-down principle as characteristic for any flavor of idealism. The decisive question then is how could we think the becoming of ideas.It is clearly one of the weaknesses of idealistic positions that they induce a salient vulnerability regarding the issue of justification. As a philosophical structure, idealism mixes content with value in the structural domain, consequently and quite directly leading to a certain kind of blind spot: political power is justified by the right idea. The factual consequences have been disastrous throughout history.

So, there are several alternatives to think about this becoming. But even before we consider any alternative, it should be clear that something like “becoming” and “idealism” is barely compatible. Maybe, a very soft idealism, one that already turned into pragmatism, much in the vein of Charles S. Peirce, could allow to think process and ideas together. Hegel’s position, or as well Schelling’s, Fichte’s, Marx’s or Frege’s definitely exclude any such rapprochement or convergence.

The becoming of ideas could not thought as something that is flowing down from even greater transcendental heights. Of course, anybody may choose to invoke some kind of divinity here, but obviously that does not help much. A solution according to Hegel’s great spirit, history itself, is not helpful either, even as this concept implied that there is something in and about the community that is indispensable when it comes to thinking. Much later, Wittgenstein took a related route and thereby initiated the momentum towards the linguistic turn. Yet, Hegel’s history is not useful to get clear about the becoming of ideas regarding the involved mechanism. And without such mechanisms anything like machine-based episteme, or cats having ideas, is accepted as being impossible apriori.

One such mechanism is interpretation. For us the principle of the primacy of interpretation is definitely indisputable. This does not mean that we disregard the concept of the idea, yet, we clearly take an Aristotelian position. More á jour, we could say that we are quite fond of Deleuze’s position on relating empiric impressions, affects, and thought. There are, of course many supporters in the period of time that span between Aristotle and Deleuze who are quite influential for our position.2
Yet, somehow it culminated all in the approach that has been labelled French philosophy, and which for us comprises mainly Michel Serres, Gilles Deleuze and Michel Foucault, with some predecessors like Georges Simondon. They converged towards a position that allow to think the embedding of ideas in the world as a process, or as an ongoing event [3,4], and this embedding is based on empiric affects.

So far, so good. Yet, we only declared the kind of raft we will build to sail with. We didn’t mention anything about how to build this raft or how to sail it. Before we can start to constructively discuss the relation between machines and ideas we first have to visit the concept, both as an issue and as a concept.

Concepts

“Concept” is very special concept. First, it is not externalizable, which is why we call it a strongly singular term. Whenever one thinks “concept,” there is already something like concept. For most of the other terms in our languages, such as idea, that does not hold. Such, and regarding the structural dynamics of its usage,”concept” behave similar to “language” or “formalization.”

Additionally, however, “concept” is not self-containing term like language. One needs not only symbols, one even needs a combination of categories and structured expression, there are also Peircean signs involved, and last but not least concepts relate to models, even as models are also quite apart from it. Ideas do not relate in the same way to models as concepts do.

Let us, for instance take the concept of time. There is this abundantly cited quote by  Augustine [5], a passage where he tries to explain the status of God as the creator of time, hence the fundamental incomprehensibility of God, and even of his creations (such as time) [my emphasis]:

For what is time? Who can easily and briefly explain it? Who even in thought can comprehend it, even to the pronouncing of a word concerning it? But what in speaking do we refer to more familiarly and knowingly than time? And certainly we understand when we speak of it; we understand also when we hear it spoken of by another. What, then, is time? If no one ask of me, I know; if I wish to explain to him who asks, I know not. Yet I say with confidence, that I know that if nothing passed away, there would not be past time; and if nothing were coming, there would not be future time; and if nothing were, there would not be present time.

I certainly don’t want to speculate about “time” (or God) here, instead I would like to focus this peculiarity Augustine is talking about. Many, and probably even Augustine himself, confine this peculiarity to time (and space). I think, however, this peculiarity applies to any concept.

By means of this example we can quite clearly experience the difference between ideas and concepts. Ideas are some kind of models—we will return that in the next section—, while concepts are the both the condition for models and being conditioned by models. The concept of time provides the condition for calendars, which in turn can be conceived as a possible condition for the operationalization of expectability.

“Concepts” as well as “models” do not exist as “pure” forms. We elicit a strange and eminently counter-intuitive force when trying to “think” pure concept or models. The stronger we try, the more we imply their “opposite”, which in case of concepts presumably is the embedding potentiality of mechanisms, and in case of models we could say it is simply belief. We will discuss the issue of these relation in much more detail in the chapter about the choreosteme (forthcoming). Actually, we think that it is appropriate to conceive of terms like “concept” and “model” as choreostemic singular terms, or short choreostemic singularities.

Even from an ontological perspective we could not claim that there “is” such a thing like a “concept”. Well, you may already know that we refute any ontological approach anyway. Yet, in case of choreostemic singular terms like “concept” we can’t simply resort to our beloved language game. With respect to language, the choreosteme takes the role of an apriori, something like the the sum of all conditions.

Since we would need a full discussion of the concept of the choreosteme we can’t fully discuss the concept of “concept” here.  Yet, as kind of a summary we may propose that the important point about concepts is that it is nothing that could exist. It does not exist as matter, as information, as substance nor as form.

The language game of “concept” simply points into the direction of that non-existence. Concepts are not a “thing” that we could analyze, and also nothing that we could relate to by means of an identifiable relation (as e.g. in a graph). Concepts are best taken as gradient field in a choreostemic space, yet, one exhibiting a quite unusual structure and topology. So far, we identified two (of a total of four) singularities that together spawn the choreostemic space. We also could say that the language game of “concept” is used to indicate a certain form of a drift in the choreostemic space. (Later we also will discuss the topology of that space, among many other issues.)

For our concerns here in this chapter, the machine-based episteme, we can conclude that it would be a misguided approach to try to implement concepts (or their formal analysis). The issue of the conditions for the ability to move around in the choreostemic space we have to postpone. In other words, we have confined our task, or at least, we found a suitable entry  point for our task, the investigation of the relation between machines and ideas.

Machines and Ideas

When talking about machines and ideas we are, here and for the time being, not interested in the usage of machines to support “having” ideas. We are not interested in such tooling for now. The question is about the mechanism inside the machine that would lead to the emergence of ideas.

Think about the idea of a triangle. Certainly, triangles as we imagine them do not belong to the material world. Any possible factual representation is imperfect, as compared with the idea. Yet, without the idea (of the triangle) we wouldn’t be able to proceed, as, for instance, towards land survey. As already said, ideas serve as models, they do not involve formalization, they often live as formalization (though not always a mathematical one) in the sense of an idealized model, in other words they serve as ladder spokes for actions. Concepts, if we in contrast them to ideas, that is, if we try to distinguish them, never could be formalized, they remain inaccessible as condition. Nothing else could be expected  from a transcendental singularity.

Back to our triangle. Despite we can’t represent them perfectly, seeing a lot of imperfect triangles gives rise to the idea of the triangle. Rephrased in this way, we may recognize that the first half of the task is to look for a process that would provide an idealization (of a model), starting from empirical impressions. The second half of the task is to get the idea working as a kind of template, yet not as a template. Such an abstract pattern is detached from any direct empirical relation, despite the fact that once we started with with empiric data.

Table 1: The two tasks in realizing “machinic idealism”

Task 1: process of idealization that starts with an intensional description
Task 2: applying the idealization for first-of-a-kind-encounters

Here we should note that culture is almost defined by the fact that it provides such ideas before any individual person’s possibility to collect enough experience for deriving them on her own.

In order to approach these tasks, we need first model systems that exhibit the desired behavior, but which also are simple enough to comprehend. Let us first deal with the first half of the task.

Task 1: The Process of Idealization

We already mentioned that we need to start from empirical impressions. These can be provided by the Self-organizing Map (SOM), as it is able to abstract from the list of observations (the extensions), thereby building an intensional representation of the data. In other words, the SOM is able to create “representative” classes. Of course, these representations are dependent on some parameters, but that’s not the important point here.

Once we have those intensions available, we may ask how to proceed in order to arrive at something that we could call an idea. Our proposal for an appropriate model system consists from the following parts:

  • (1) A small set (n=4) of profiles, which consist of 3 properties; the form of the profiles is set apriori such that they overlap partially;
  • (2) a small SOM, here with 12×12=144 nodes; the SOM needs to be trainable and also should provide classification service, i.e. acting as a model
  • (3) a simple Monte-Carlo-simulation device, that is able to create randomly varied profiles that deviate from the original ones without departing too much;
  • (4) A measurement process that is recording the (simulated) data flow

The profiles are defined as shown in the following table (V denotes variables, C denotes categories, or classes):

V1 V2 V3
C1 0.1 0.4 0.6
C2 0.8 0.4 0.6
C3 0.3 0.1 0.4
C4 0.2 0.2 0.8

From these parts we then build a cyclic process, which comprises the following steps.

  • (0) Organize some empirical measurement for training the SOM; in our model system, however, we use the original profiles and create an artificial body of “original” data, in order to be able to detect the relevant phenomenon (we have perfect knowledge about the measurement);
  • (1) Train the SOM;
  • (2) Check the intensional descriptions for their implied risk (should be minimal, i.e. beyond some threshold) and extract them as profiles;
  • (3) Use these profiles to create a bunch of simulated (artificial) data;
  • (4) Take the profile definitions and simulate enough records to train the SOM,

Thus, we have two counteracting forces, (1) a dispersion due to the randomizing simulation, and (2) the focusing of the SOM due to the filtering along the separability, in our case operationalized as risk (1/ppv=positive predictive value) per node. Note that the SOM process is not a directly re-entrant process as for instance Elman networks [6,7,8].3

This process leads not only to a focusing contrast-enhancement but also to (a limited version) of inventing new intensional descriptions that never have been present in the empiric measurement, at least not salient enough to show up as an intension.

The following figure 1a-1i shows 9 snapshots from the evolution of such a system, it starts top-left of the portfolio, then proceeds row-wise from left to right down to the bottom-right item. Each of the 9 items displays a SOM, where the RGB-color corresponds to the three variables V1, V2, V3. A particular color thus represents a particular profile on the level of the intension. Remember, that the intensions are built from the field-wise average across all the extensions collected by a particular node.

Well, let us now contemplate a bit about the sequence of these panels, which represents the evolution of the system. The first point is that there is no particular locational stability. Of course, not, I am tempted to say, since a SOM is not an image that represents as image. A SOM contains intensions and abstractions, the only issue that counts is its predictive power.

Now, comparing the colors between the first and the second, we see that the green (top-right in 1a, middle-left in 1b) and the brownish (top-left in 1a, middle-right in 1b) appear much more clear in 1b as compared to 1a. In 1a, the green obviously was “contaminated” by blue, and actually by all other values as well, leading to its brightness. This tendency prevails. In 1c and 1d yellowish colors are separated, etc.

Figure 1a thru 1i: A simple SOM in a re-entrant Markov process develops idealization. Time index proceeds from top-left to bottom-right.

The point now is that the intensions contained in the last SOM (1i, bottom-right of the portfolio) have not been recognizable in the beginning, in some important respect they have not been present. Our SOM steadily drifted away from its empirical roots. That’s not a big surprise, indeed, for we used a randomization process. The nice thing is something different: the intensions get “purified”, changing thereby their status from “intensions” to “ideas”.

Now imagine that the variables V1..Vn represent properties of geometric primitives. Our sensory apparatus is able to perceive and to encode them: horizontal lines, vertical lines, crossings, etc. In empiric data our visual apparatus may find any combination of those properties, especially in case of a (platonic) school (say: academia) where the pupils and the teachers draw triangles over triangles into the wax tablets, or into the sand of the pathways in the garden…

By now, the message should be quite clear: there is nothing special about ideas. In abstract terms, what is needed is

  • (1) a SOM-like structure;
  • (2) a self-directed simulation process;
  • (3) re-entrant modeling

Notice that we need not to specify a target variable. The associative process itself is just sufficient.

Given this model it should not surprise anymore why the first philosophers came up with idealism. It is almost built into the nature of the brain. We may summarize our achievements in the following characterization;

Ideas can be conceived as idealizations of intensional descriptions.

It is of course important to be aware of the status of such a “definition”. First, we tried to separate concepts and ideas. Most of the literature about ideas conflate them.Yet, as long as they are conflated, everything and any reasoning about mental affairs, cognition, thinking and knowledge necessarily remains inappropriate. For instance, the infamous discourse about universals and qualia seriously suffered from that conflation, or more precisely, they only arose due to that mess.

Second, our lemma is just an operationalization, despite the fact that we are quite convinced about its reasonability. Yet, there might be different ones.

Our proposal has important benefits though, as it matches a lot of the aspects commonly associated the the term “idea.” In my opinion, what is especially striking about the proposed model is the observation that idealization implicitly also led to the “invention” of “intensions” that were not present in the empiric data. Who would have been expecting that idealization is implicitly inventive?

Finally, two small notes should be added concerning the type of data and the relationship between the “idea” as a continuously intermediate result of the re-entrant SOM process. One should be aware that the “normal” input to natural associative systems are time series. Our brain is dealing with a manifold of series of events, which is mapped onto the internal processes, that is, onto another time-based structure. Prima facie Our brain is not dealing with tables. Yet, (virtual) tabular structures are implied by the process of propertization, which is an inevitable component of any kind of modeling. It is well-known that is is time-series data and their modeling that give rise to the impression of causality. In the light of ideas qua re-entrant associativity, we now can easily understand the transition from networks of potential causal influences to the claim of “causality” as some kind of a pure concept. Despite the idea of causality (in the Newtonian sense) played an important role in the history of science, it is just that: a naive idealization.

The other note concerns the source of the data.  If we consider re-entrant informational structures that are arranged across large “distances”, possibly with several intermediate transformative complexes (for which there are hints from neurobiology) we may understand that for a particular SOM (or SOM-like structure) the type of the source is completely opaque. To put it short, it does not matter for our proposed mechanism whether the data are sourced as empiric data from the external world,or as some kind of simulated, surrogated re-entrant data from within the system itself. In such wide-area, informationally re-entrant probabilistic networks we may expect kind of a runaway idealization. The question then is about the minimal size necessary for eliciting that effect. A nice corollary of this result is the insight that logistic networks, such like the internet or the telephone wiring cable NEVER will start to think on itself, as some still expect. Yet, since there a lot of brains as intermediate transforming entities embedded in this deterministic cablework, we indeed may expect that the whole assembly is much more than could be achieved by a small group of humans living, say around 1983. But that is not really a surprise.

Task 2: Ideas, applied

Ideas are an extremely important structural phenomenon, because they allow to recognize things and to deal with tasks that we never have seen before. We may act adaptively before having encountered a situation that would directly resemble—as equivalence class—any intensional description available so far.

Actually, it is not just one idea, it is a “system” of ideas that is needed for that. Some years ago, Douglas Hofstadter and his group3 devised a model system suitable for demonstrating exactly this: the application of ideas. They called the project (and the model system) Copycat.

We won’t discuss Copycat and analogy-making rules by top-down ideas here (we already introduced it elsewhere). We just want to note that the central “platonic” concept in Copycat is a dynamic relational system of symmetry relations. Such symmetry relations are for instance “before”, “after”, or “builds a group”, “is a triple”, etc. Such kind of relations represent different levels of abstractions, but that’s not important. Much more important is the fact that the relations between these symmetry relations are dynamic and will adapt according to the situation at hand.

I think that these symmetry relations as conceived by the Fargonauts are on the same level as our ideas. The transition from ideas to symmetries is just a grammatological move.

The case of Biological Neural Systems

Re-entrance seems to be an important property of natural neural networks. Very early on in the liaison of neurobiology and computer science, starting with Hebb and Hopfield in the beginning of the 1940ies, recurrent networks have been attractive for researchers. If we take a look to drawings like the following, created (!) by Ramon y Cajal [10] in the beginning of the 20th century.

Figure 2a-2c: Drawings by Ramon y Cajal, the Spain neurobiologist. See also:  History of Neuroscience. a: from a Sparrow’s brain, b: motor brain in human brain, c: Hypothalamus in human brain

Yet, Hebb, Hopfield and Elman got trapped by the (necessary) idealization of Cajal’s drawings. Cajal’s interest was to establish and to proof the “neuron hypothesis”, i.e. that brains work on the basis of neurons. From Cajal’s drawings to the claim that biological neuronal structures could be represented by cybernetic systems or finite state machines is, honestly, a breakneck, or, likewise, ideology.

Figure 3: Structure of an Elman Network; obviously, Elman was seriously affected by idealization (click for higher resolution).

Thus, we propose to distinguish between re-entrant and recurrent networks. While the latter are directly wired onto themselves in a deterministic manner, that is the self-reference is modeled on the morphological level, the former are modeled on the  informational level. Since it is simply impossible for cybernetic structure to reflect neuromorphological plasticity and change, the informational approach is much more appropriate for modeling large assemblies of individual “neuronal” items (cf. [11]).

Nevertheless, the principle of re-entrance remains a very important one. It is a structure that is known to lead to contrast enhancement and to second-order memory effects. It is also a cornerstone in the theory (theories) proposed by Gerald Edelman, who probably is much less affected by cybernetics (e.g. [12]) than the authors cited above. Edelman always conceived the brain-mind as something like an abstract informational population; he even was the first adopting evolutionary selection processes (Darwinian and others) to describe the dynamics in the brain-mind.

Conclusion: Machines and Choreostemic Drift

Out point of departure was to distinguish between ideas and concepts. Their difference becomes visible if we compare them, for instance, with regard to their relation to (abstract) models. It turns out that ideas can be conceived as a more or less stable immaterial entity (though not  “state”) of self-referential processes involving self-organizing maps and the simulated surrogates of intensional descriptions. Concepts on the other hand are described as a transcendental vector in choreostemic processes. Consequently, we may propose only for ideas that we can implement their conditions and mechanisms, while concepts can’t be implemented. It is beyond the expressibility of any technique to talk about the conditions for their actualization. Hence, the issue of “concept” has been postponed to a forthcoming chapter.

Ideas can be conceived as the effect of putting a SOM into a reentrant context, through which the SOM develops a system of categories beyond simple intensions. These categories are not justified by empirical references any more, at least not in the strong sense. Hence, ideas can be also characterized as being clearly distinct from models or schemata. Both, models and schemata involve classification, which—due to the dissolved bonds to empiric data—can not be regarded a sufficient component for ideas. We would like to suggest the intended mechanism as the candidate principle for the development ideas. We think that the simulated data in the re-entrant SOM process should be distinguished from data in contexts that are characterized by measurement of “external” objects, albeit their digestion by the SOM mechanism itself remains the same.

From what has been said it is also clear that the capability of deriving ideas alone is still quite close to the material arrangements of a body, whether thought as biological wetware or as software. Therefore, we still didn’t reach a state where we can talk about epistemic affairs. What we need is the possibility of expressing the abstract conditions of the episteme.

Of course, what we have compiled here exceeds by far any other approach, and additionally we think that it could serve as as a natural complement to the work of Douglas Hofstadter. In his work, Hofstadter had to implement the platonic heavens of his machine manually, and even for the small domain he’d chosen it has been a tedious work. Here we proposed the possibility for a seamless transition from the world of associative mechanisms like the SOM to the world of platonic Copy-Cats, and “seamless” here refers to “implementable”.

Yet, what is really interesting is the form of choreostemic movement or drift, resulting from a particular configuration of the dynamics in systems of ideas. But this is another story, perhaps related to Felix Guattari’s principle of the “machinic”, and it definitely can’t be implemented any more.

.
Notes

1. we did so in the recent chapter about data and their transformation, but also see the section “Overall Organization” in Technical Aspects of Modeling.

2. You really should be aware that this trace we try to put forward here does not come close to even a coarse outline of all of the relevant issues.

3. they called themselves the “Fargonauts”, from FARG being the acronym for “Fluid Analogy Research Group”.

4. Elman networks are an attempt to simulate neuronal networks on the level of neurons. Such approaches we rate as fundamentally misguided, deeply inspired by cybernetics [9], because they consider noise as disturbance. Actually, they are equivalent to finite state machines. It is somewhat ridiculous to consider a finite state machine as model for learning “networks”. SOM, in contrast, especially if used in architectures like ours, are fundamentally probabilistic structures that could be regarded as “feeding on noise.” Elman networks, and their predecessor, the Hopfield network are not quite useful, due to problems in scalability and, more important, also in stability.

  • [1] Douglas Hofstadter, Douglas R. Hofstadter, Fluid Concepts And Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought. Basic Books, New York 1996.  p.365
  • [2] Gernot Böhme, “Platon der Empiriker.” in: Gernot Böhme, Dieter Mersch, Gregor Schiemann (eds.), Platon im nachmetaphysischen Zeitalter. Wissenschaftliche Buchgesellschaft, Darmstadt 2006.
  • [3] Marc Rölli (ed.), Ereignis auf Französisch: Von Bergson bis Deleuze. Fin, Frankfurt 2004.
  • [4] Gilles Deleuze, Difference and Repetition. 1967
  • [5] Augustine, Confessions, Book 11 CHAP. XIV.
  • [6] Mandic, D. & Chambers, J. (2001). Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley.
  • [7] J.L. Elman, (1990). Finding Structure in Time. Cognitive Science 14 (2): 179–211.
  • [8] Raul Rojas, Neural Networks: A Systematic Introduction. Springer, Berlin 1996. (@google books)
  • [9] Holk Cruse, Neural Networks As Cybernetic Systems: Science Briefings, 3rd edition. Thieme, Stuttgart 2007.
  • [10] Santiago R.y Cajal, Texture of the Nervous System of Man and the Vertebrates: Volume I: 1, Springer, Wien 1999, edited and translated by Pedro Pasik & Tauba Pasik. see google books
  • [11] Florence Levy, Peter R. Krebs (2006), Cortical-Subcortical Re-Entrant Circuits and Recurrent Behaviour. Aust N Z J Psychiatry September 2006 vol. 40 no. 9 752-758.
    doi: 10.1080/j.1440-1614.2006.01879
  • [12] Gerald Edelman: “From Brain Dynamics to Consciousness: A Prelude to the Future of Brain-Based Devices“, Video, IBM Lecture on Cognitive Computing, June 2006.

۞

SOM = Speedily Organizing Map

February 12, 2012 § Leave a comment

The Self-organizing Map is a powerful and high-potential computational procedure.

Yet, there is no free lunch, especially not for procedures that are able to deliver meaningful results.

The self-organizing map is such a valuable procedure, we have discussed its theoretical potential with regard to a range of different aspects in other chapters. Here, we want not to deal further with such theoretical or even philosophical issues, e.g. related to the philosophy of mind, instead we focus the issue of performance, understood simply in terms of speed.

For all those demo SOMs the algorithmic time complexity is not really an issue. The algorithm approximates rather quickly to a stable state. Yet, small maps—where “small” means something like less than 500 nodes or so—are not really interesting. It is much like in brains. Brains are made largely from neurons and some chemicals, and a lot of them can do amazing things. If you take 500 of them you may stuff a worm in an appropriate manner, but not even a termite. The important questions thus are, beyond the nice story about theoretical benefits.

What happens with the SOM principle if we connect 1’000’000 nodes?

How to organize 100, 1000 or 10’000 of such million-nodes SOMs?

By these figures we would end up with somewhat around 1..10 billions of nodes1, all organized along the same principles. Just to avoid a common misunderstanding here: these masses of neurons are organized in a very similar manner, yet the totality of them builds a complex system as we have described it in our chapter about complexity. There are several, if not many emergent levels, and a lot of self-referentiality. These 1 billion nodes are not all engaged with segmenting external data! We will see elsewhere, in the chapter about associative storage and memory, how such deeply integrated modular system could be conceived of. There are some steps to take, though not terribly complicated or difficult ones.

When approaching such scales, the advantage of the self-organization turns palpably into a problematic disadvantage. “Self-organizing” means “bottom-up,” and this bottom-up direction in SOMs is represented by the fact that all records representing the observations have repeatedly to be compared to all nodes in the SOM in order to find the so-called “best matching unit” (BMU). The BMU is just that node in the network that exhibits an intensional profile that is the most similar among all the other profiles2. Though the SOM avoids to compare all records to all records, its algorithmic complexity scales as a power-function with respect to its own scale! Normally, algorithms are dependent on the size of the data, but not to its own “power.”

In its naive form the SOM shows a complexity of something like O(n w m2), where n is the amount of data (number of records, size of feature set), w the number of nodes to be visited for searching the BMU, and m2 the number of nodes affected in the update procedure. w and m are scaled by factors f1,f2 <1, but the basic complexity remains. The update procedure affects an area that is dependent on the size of the SOM, therefore the exponent. The exact degree of algorithmic complexity is not absolutely determined, since it depends on the dynamics in the learning function, among other things.

The situation worsens significantly if we apply improvements to the original flavor of the SOM, e.g.

  • – the principle of homogenized variance (per variable across extensional containers),
  • – in targeted modeling, tracking the explicit classification performance per node on the level of records, which means that the data has to be referenced
  • – size balancing of nodes,
  • – morphological differentiation like growing and melting, as in the FluidSOM, which additionally allows for free ranging nodes,
  • – evolutionary feature selection and creating proto-hypothesis,
  • – dynamic transformation of data,
  • – then think about the problem of indeterminacy of empiric data, which enforces differential modeling, i.e. a style of modeling that is performing experimental investigation of the dependency of the results on the settings (the free parameters) of the computational procedure: sampling the data, choosing the target, selecting a resolution for the scale of the resulting classification, choosing a risk attitude, among several more.

All affects the results of modeling, that is the prognostic/diagnostic/semantic conclusions one could draw from the modeling. Albeit all these steps could be organized based on a set of rules, including applying another instance of a SOM, and thus could be run automatically, all of these necessary explorations require separate modeling. It is quite easy to set up an exploration plan for differential modeling that would require several dozens of models, and if evolutionary optimization is going to be applied, 100s if not thousands of different maps have to be calculated.

Fortunately, the SOM offers a range of opportunities for using dynamic look-up tables and parallel processing. A SOM consisting of 1’000’000 neurons could easily utilize several thousand threads, without much worries about concurrency (or the collisions of parallel threads). Unfortunately, such computers are not available yet, but you got the message…

Meanwhile we have to apply optimization through dynamically built look-up tables.  These I will describe briefly in the following sections.

Searching the Most Similar Node

An ingredient part of speeding up the SOM in real-life application is an appropriate initialization of the intensional profiles across the map. Of course, precisely this can not be known in advance, at least not exactly. The self-organization of the map is the shortest path to its final state, there is no possibility for an analytic short-cut. Kohonen proposes to apply Principal Component Analysis (PCA) for calculating the initial values. I am convinced that this is not a good idea. The PCA is deeply incompatible with the SOM, hence it will respond to very different traits in the data. PCA and SOM behave similar only in the case of demo cases…

Preselecting the Landing Zone

A better alternative is the SOM itself. Since the mapping organized by the SOM is preserving the topology of the data, we could apply a much smaller SOM, even a nested series of down-scaled SOMs to create a coarse model for selecting the appropriate sub-population in the large SOM. The steps are the following:

  • 1. create the main SOM, say 40’000 nodes, organized on a square rectangle, where the sides are of the relative length of 200 nodes;
  • 2. create a SOM for preselecting the landing zone, scaled approximately 14 by 14 nodes, and use the same structure (i.e. the same feature vectors) as for the large SOM;
  • 3. Prime the small SOM with a small but significant sample of the data, which comprise say 2000..4000 records in this case of around 200 nodes; draw this sample randomly from the data; this step should complete comparatively quick (by a factor of 200 in our example);
  • 4. initialize the main SOM by a blurred (geometric) projection of the intensional profiles from the minor to the larger SOM;
  • 5. now use the minor SOM as a model for the selection of the landing zone, simply by means of geometric projection.

As a result, the number of nodes to be visited in the large SOM in order to find the best match remains almost constant.
There is an interesting correlate to this technique. If one needs a series of SOM based representations of the data distinguished just by the maximum number of nodes in the map, one should always start with the lowest, i.e. most coarse resolution, with the least number of nodes. The results then can be used as a projective priming of the SOM on the next level of resolution.

Explicit Lookup Tables linking Observations to SOM areas

In the previous section we described the usage of a secondary much smaller SOM as a device for limiting the number of nodes to be scanned. The same problematics can be addressed by explicit lookup tables that establish a link between a given record and a vicinity around the last (few) best matching units.

If the SOM is approximately stable, that is, after the SOM has seen a significant portion of the data, it is not necessary any more to check the whole map. Just scan the vicinity around the last best matching node in the map. Again, the number of nodes necessary to be checked is almost constant.

The stability can not be predicted in advance, of course. The SOM is, as the name says, self-organizing (albeit in a weak manner). As a rule of thumb one could check the average number of observations attached to a particular node, the average taken across all nodes that contain at least one record. This average filling should be larger than 8..10 (due to considerations based on variance, and some arguments derived from non-parametric statistics… but it is a rule of thumb).

Large Feature Vectors

Feature vectors can be very large. In life sciences and medicine I experienced cases with 3000 raw variables. During data transformation this number can increase to 4000..6000 variables. Te comparison of such feature vectors is quite expensive.

Fortunately, there are some nice tricks, which are all based on the same strategy. This strategy comprises the following steps.

  • 1. create a temporary SOM with the following, very different feature vector; this vector has just around 80..100 (real-valued) positions and 1 position for the index variable (in other words, the table key); such the size of the vector is a 60th of the original vector, if we are faced with 6000 variables.
  • 2. create the derived vectors by encoding the records representing the observation by a technique that is called “random projection”; such a projection is generated by multiplying the data vector with a token from a catalog of (labeled) matrices, that are filled with uniform random numbers ([0..1]).
  • 3. create the “random projection” SOM based on these transformed records
  • 4. after training, replace the random projection data with real data, re-calculate the intensional profiles accordingly, and run a small sample of the data through the SOM for final tuning.

The technique of random projection has been invented in 1988. The principle works because of two facts:

  • (1) Quite amazingly, all random vectors beyond a certain dimensionality (80..200, as said before) are nearly orthogonal to each other. The random projection compresses the original data without loosing the bits of information that are distinctive, even if they are not accessible in an explicit manner any more.
  • (2) The only trait of the data that is considered by the SOM is their potential difference.

Bags of SOMs

Finally, one can take advantage of splitting the data into several smaller samples. These samples require only smaller SOMs, which run much faster (we are faced with a power law). After training each of the SOMs, they can be combined into a compound model.

This technique is known as bagging in Data Mining. Today it is also quite popular in the form of so-called random forests, where instead of one large decision tree man smaller ones are built and then combined. This technique is very promising, since it is a technique of nature. Its simply modularization on an abstract level, leading the net level of abstraction in a seamless manner. It is also one of our favorite principles for the overall “epistemic machine”.

Notes

1. This would represent just around 10% of the neurons of our brain, if we interpret each node as a neuron. Yet, this comparison is misleading. The functionality of a node in a SOM rather represents a whole population of neurons, although there is no 1:1 principle transferable between them. Hence, such a system would be roughly of the size of a human brain, and much more important, it is likely organized in a comparable, albeit rather alien, manner.

2. Quite often, the vectors that are attached to the nodes are called weight vectors. This is a serious misnomer, as neither the nodes are weighted by this vector (alone), nor the variables that make up that vector (for more details see here). Conceptually it is much more sound to call those vectors “intensional profiles.” Actually, one could indeed imagine  a weight vector that would control the contribution (“weight”) of variables to the similarity / distance between two of such vectors. Such weight vectors could even be specific for each of the nodes.

References…

  • [1]

.

۞

Where Am I?

You are currently browsing entries tagged with self organizing map at The "Putnam Program".