Transformation
May 17, 2012 § Leave a comment
In the late 1980ies there was a funny, or strange, if you like,
discussion in the German public about a particular influence of the English language onto the German language. That discussion got not only teachers engaged in higher education going, even „Der Spiegel“, Germany’s (still) leading weekly news magazine damned the respective „anglicism“. What I am talking about here considers the attitude to „sense“. At those times well 20 years ago, it was meant to be impossible to say „dies macht Sinn“, engl. „this makes sense“. Speakers of German at that time understood the “make” as “to produce”. Instead, one was told, the correct phrase had to be „dies ergibt Sinn“, in a literal, but impossible translation something like „this yields sense“, or even „dies hat Sinn“, in a literal, but again wrong and impossible translation, „this has sense“. These former ways of building a reference to the notion of „sense“ feels even awkward for many (most?) speakers of German language today. Nowadays, the English version of the meaning of the phrase replaced the old German one, and one even can find in the “Spiegel“ now the analogue to “making” sense.
Well, the issue here is not just one historical linguistics or one of style. The differences that we can observe here are deeply buried into the structure of the respective languages. It is hard to say whether such idioms in German language are due to the history of German Idealism, or whether this particular philosophical stance developed on the basis of the structures in the language. Perhaps a bit of both, one could say from a Wittgensteinian point of view. Anyway, we may and can be relate such differences in “contemporary” language to philosophical positions.
It is certainly by no means an exaggeration to conclude that the cultures differ significantly in what their languages allow to be expressible. Such a thing as an “exact” translation is not possible beyond trivial texts or a use of language that is very close to physical action. Philosophically, we may assign a scale, or a measure, to describe the differences mentioned above in probabilistic means, and this measure spans between pragmatism and idealism. This contrast also deeply influences philosophy itself. Any kind of philosophy comes in those two shades (at least), often expressed or denoted by the attributes „continental“ and „angloamerican“. I think these labels just hide the relevant properties. This contrast of course applies to the reading of idealistic or pragmatic philosophers itself. It really makes a difference (1980ies German . . . „it is a difference“) whether a native English speaking philosopher reads Hegel, or a German native, whether a German native is reading Peirce or an American guy, whether Quine conducts research in logic or Carnap. The story quickly complicates if we take into consideration French philosophy and its relation to Heidegger, or the reading of modern French philosophers in contemporary German speaking philosophy (which is almost completely absent).1
And it becomes even more complicated, if not complex and chaotic, if we consider the various scientific subcultures as particular forms of life, formed by and forming their own languages. In this way it may well seem to be rather impossible—at least, one feels tempted to think so—to understand Descartes, Leibniz, Aristotle, or even the preSocratics, not to speak about the CroMagnon culture2, albeit it is probably more appropriate to reframe the concept of understanding. After all, it may itself be infected by idealism.
In the chapters to come you may expect the following sections. As we did before we’ll try to go beyond the mere technical description, providing the historical trace and the wider conceptual frame:
 A Shift of Perspective
 Towards the Relational Perspective
 Positioning Transformation (again)
 The Abstract Perspective
 Revitalizing Punch Cards and Stacks
 Transforming Data
Numerical Data: Numbers, just Numbers?
From Strings to Orders to Numbers
A Shift of Perspective
Here, I need this reference to the relativity as it is introduced in—or by —language for highlighting a particular issue. The issue concerns a shift in preference, from the atom, the point, from matter, substance, essence and metaphysical independence towards the relation and its dynamic form, the transformation. This shift concerns some basic relationships of the weave that we call “Lebensform” (form of life), including the attitude towards those empiric issues that we will deal with in a technical manner later in this essay, namely the transformation of “data”. There are, of course, almost countless aspects of the topos of transformation, such like evolutionary theory, the issue of development, or, in the more abstract domains, mathematical category theory. In some way or another we already dealt with these earlier (for category theory, for evolutionary theory). These aspects of the concept of transformation will not play a role here.
In philosophical terms the described difference between German and English language, and the change of the respective German idiom marks the transition from idealism to pragmatism. This corresponds to the transition from a philosophy of primal identity to one where difference is transcendental. In the same vein, we could also set up the contrast between logical atomism and the event as philosophical topoi, or between favoring existential approaches and ontology against epistemology. Even more remarkably, we also find an opposing orientation regarding time. While idealism, materialism, positivism or existentialism (and all similar attitudes) are heading backwards in time, and only backwards, pragmatism and, more generally, a philosophy of events and transformation is heading forward, and only forward. It marks the difference between settlement (in Heideggerian „FestStellen“, English something like „fixing at a location“, putting something into the „Gestell“3) and anticipation. Settlements are reflected by laws of nature in which time does not—and shall not—play a significant role. All physical laws, and almost all theories in contemporary physics are symmetric with respect to time. The “law perspective” blinds against the concept of context, quite obviously so. Yet, being blinded against context also disables to refer to information in an adequate manner.
In contrast, within a framework that is truly based on the primacy of interpretation and thus following the anticipatory paradigm, it does not make sense to talk about “laws”. Notably, issues like the “problem” of induction exist only in the framework of the static perspective of idealism and positivism.
It is important to understand that these attitudes are far from being just “academic” distinctions. There are profound effects to be found on the level of empiric activity, how data are handled using which kind of methods. Further more, they can’t be “mixed”, once one of them have been chosen. Despite we may switch between them in a sequential manner, across time or across domains, we can’t practice them synchronously as the whole setup of the life form is influenced. Of course, we do not want to rate one of them as the “best”, we just want to ensure that it is clear that there are particular consequences of that basic choice.
Towards the Relational Perspective
As late as 1991, Robert Rosen’s work about „Relational Biology“ has been anything but nearby [1]. As a mathematician, Rosen was interested in the problematics of finding a proper way to represent living systems by formal means. As a result of this research, he strongly proposed the “relational” perspective. He identifies Nicolas Rashevsky as the originator of it, who mentioned about it around 1935 for the first time. It really sounds strange that relational biology had to be (re)invented. What else than relations could be important in biology? Yet, still today the atomistic thinking is quite abundant, think alone about the reductionist approaches in genetics (which fortunately got seriously attacked meanwhile4). Or think about the still prevailing helplessness in various domains to conceive appropriately about complexity (see our discussion of this here). Being aware of relations means that the world is not conceived as made from items that are described by inputs and outputs with some analytics, or say deterministics, in between. Only such items could be said that they “function”. The relational perspective abolishes the possibility of the reduction of real “systems” to “functions”.
As it is already indicated by the appearance of Rashevsky, there is, of course, a historical trace for this shift, kind of soil emerging from intellectual sediments.5 While the 19th century could be considered as being characterized by the topos of population (of atoms)—cf. the line from Laplace and Carnot to Darwin and Boltzmann—we can observe a spawning awareness for the relation in the 20th century. Wittgenstein’s Tractatus started to oppose Frege and has been always in stark contrast to logical positivism, then accompanied by Zermelo (“axiom” of choice6), Rashevsky (relational biology), Turing (morphogenesis in complex systems), McLuhan (media theory), String Theory in physics, Foucault (field of propositions), and Deleuze (transcendental difference). Comparing Habermas and Luhmann on the one side—we may label their position as idealistic functionalism—with Sellars and Brandom on the other—who have been digging into the pragmatics of the relation as it is present in humans and their culture—we find the same kind of difference. We also could include Gestalt psychology as kind of a precursor to the party of “relationalists,” mathematical category theory (as opposed to set theory) and some strains from the behavioral sciences. Researchers like Ekman & Scherer (FACS), Kummer (sociality expresses as dynamics in relative positions), or Colmenares (play) focused the relation itself, going far beyond the implicit reference to the relation as a secondary quality. We may add David Shane7 for architecture and Clarke or Latour8 for sociology. Of course, there are many, many other proponents who helped to grow the topos of the relation, yet, even without a detailed study we may guess that compared to the main streams they still remain comparatively few.
These difference could not be underestimated in the field of information sciences, computer sciences, data analysis, or machinebased learning and episteme. It makes a great difference whether one would base the design of an architecture or the design of use on the concept of interfaces, most often defined as a location of full control, notably in both directions, or on the concept of behavioral surfaces.9. In the field of empiric activities, that is modeling in its wide sense, it yields very different setups or consequences whether we start with the assumption of independence between our observables or between our observations or whether we start with no assumptions about the dependency between observables, or observations, respectively. The latter is clearly the preferable choice in terms of intellectual soundness. Even if we stick to the first of both alternatives, we should NOT use methods that work only if that assumption is satisfied. (It is some kind of a mystery that people believe that doing so could be called science.) The reason is pretty simple. We do not know anything about the dependency structures in the data before we have finished modeling. It would inevitably result in a petitio principii if we’d put “independence” into the analysis, wrapped into the properties of methods. We would just find. . . guess what. After destroying facts—in the Wittgensteinian sense understood as relationalities—into empiristic dust we will not be able to find any meaningful relation at all.
Positioning Transformation (again)
Similarly, if we treat data as a “true” mapping of an outside “reality”, as “givens” that eventually are distorted a bit by more or less noise, we will never find multiplicity in the representations that we could derive from modeling, simply because it would contradict the prejudice. We also would not recognize all the possible roles of transformation in modeling. Measurement devices act as a filter10, and as such it does not differ from any analytic transformation of the data. From the perspective of the associative part of modeling, where the data are mapped to desired outcomes or decisions, “raw” data are simply not distinguishable from “transformed” data, unless the treatment itself would not be encoded as data as well. Correspondingly, we may consider any data transformation by algorithmic means as additional measurement devices, which are responding to particular qualities in the observations on their own. It is this equivalence that allows for the change from the linear to a circular and even a selfreferential arrangement of empiric activities. Longterm adaptation, I would say even any adaptation at all is based on such a circular arrangement. The only thing we’d to change to earn the new possibilities was to drop the “passivist” representationalist realism11.
Usually, the transformation of data is considered as an issue that is a function of discernibility as an abstract property of data (Yet, people don’t talk like that, it’s our way of speaking here). Today, the respective aphorism as coined by Bateson already became proverbial, despite its simplistic shape: Information is the difference that makes the difference. According to the context in which data are handled, this potential discernibility is addressed in different ways. Let us distinguish three such contexts: (i) Data warehousing, (ii) statistics, and (iii) learning as an epistemic activity.
In Data Warehousing one is usually faced with a large range of different data sources and data sinks, or consumers, where the difference of these sources and sinks simply relates to the different technologies and formats of data bases. The warehousing tool should “transform” the data such that they can be used in the intended manner on the side of the sinks. The storage of the raw data as measured from the business processes and the efforts to provide any view onto these data has to satisfy two conditions (in the current paradigm). It has to be neutral—data should not be altered beyond the correction of obvious errors—and its performance, simply in terms of speed, has to be scalable, if not even independent from the data load. The activities in Data Warehousing are often circumscribed as “Extract, Transform, Load”, abbreviated ETL. There are many and large software solutions for this task, commercial ones and open source (e.g. Talend). The effect of DWH is to disclose the potential for an arbitrary and quickly served perspective onto the data, where “perspective” means just rearranged columns and records from the database. Except cleaning and simple arithmetic operations, the individual bits of data itself remain largely unchanged.
In statistics, transformations are applied in order to satisfy the conditions for particular methods. In other words, the data are changed in order to enhance discernibility. Most popular is the logtransformation that shifts the mode of a distribution to the larger values. Two different small values that consequently are located nearby are separated better after a logtransformation, hence it is feasible to apply logtransformation to data that form a leftskewed distribution. Other transformations are aiming at a particular distribution, such as the zscore, or Fisher’s ztransformation. Interestingly, there is a further class of powerful transformations that is not conceived as such. Residuals are defined as deviation of the data from a particular model. In linear regression it is the square of the distance to the regression line.
The concept, however, can be extended to those data which do not “follow” the investigated model. The analysis of residual has two aspects, a formal one and an informal one. Formally, it is used as a complex test whether the investigated model does fit or whether it does not. The residual should not show any evident “structure”. That’s it. There is no institutional way back to the level of the investigated model, there are no rules about that, which could be negotiated in a yet to establish community. The statistical framework is a linear one, which could be seen as a heritage from positivism. It is explicitly forbidden to “optimize” a correlation by multiple actualization. Yet, informally the residuals may give hints on how to change the basic idea as represented by the model. Here we find a circular setup, where the strategy is to remove any rulebased regularity, i.e. discernibility form the data.
The effect of this circular arrangement takes completely place in the practicing human as kind of a refinement. It can’t be found anywhere in the methodological procedure itself in a rulebased form. This brings us to the third area, epistemic learning.
In epistemic learning, any of the potentially significant signals should be rendered in such a way as to allow for an optimized mapping towards a registered outcome. Such outcomes often come as dual values, or as a small group of ordinal values in the case of multiconstraint, multitarget optimization. In epistemic learning we thus find the separation of transformation and association in its most prominent form, despite the fact that data warehousing and statistics as well also are intended to be used for enhancing decisions. Yet, their linearity simply does not allow for any kind of institutionalized learning.
This arbitrary restriction to the linear methodological approach in formal epistemic activities results in two related quite unfavorable effects: First, the shamanism of “data exploration”, and second, the infamous hell of methods. One can indeed find thousands, if not 10s of thousands of research or engineering articles trying to justify a particular new method as the most appropriate one for a particular purpose. These methods themselves however are never identified as a „transformation“. Authors are all struggling for the “best” method, the whole community being neglecting the possibility—and the potential—of combining different methods after shaping them as transformations.
The laborious and neverending training necessary to choose from the huge amount of possible methods then is called methodology… The situation is almost paradox. First, the methods are claimed to tell something about the world, despite this is not possible at all, not just because those methods are analytic. It is an idealistic hope, which has been abolished already by Hume. Above all, only analytic methods are considered to be scientific. Then, through the large population of methods the choice for a particular one becomes aleatory, which renders the whole activity into a deeply nonscientific one. Additionally, it is governed by the features of some software, or the skills of the user of such software, not by a conceptual stance.
Now remember that any method is also a specific filter. Obviously, nothing could be known about the beneficiality of a particular method before the prediction that is based on the respective model had been validated. This simple insight renders “data exploration” into meaninglessness. It can only play its role within linear empirical frameworks, which are inappropriate any way. Data exploration is suggested to be done “intuitively”, often using methods of visualization. Yet, those methods are severely restricted with regard to the graspable dimensionality. More than 6 to 8 dimensions can’t be “visualized” at once. Compare this to the 2^{n} (n: number of variables) possible models and you immediately see the problem. Else, the only effect of visualization is just a primitive form of clustering. Additionally, visual inputs are images, above all, and as images they can’t play a welldefined epistemological role.12
Complementary to the nonconcept of “exploring” data13, and equally misconceived, is the notion of “preparing” data. At least, it must be rated as misconceived as far as it comprises transformations beyond error correction and arranging data into tables. The reason is the same: We can’t know whether a particular “cleansing” will enhance the predictive power of the model, in other words, whether it comprises potential information that supports the intended discernibility, before the model has been built. There is no possibility to decide which variables to include before having finished the modeling. In some contexts the information accessible through a particular variable could be relevant or even important. Yet, if we conceive transformations as preliminary hypothesis we can’t call them “preparation” any more. “Preparation” for what? For proofing the petitio principii? Certainly the peak of all preparatory nonsense is the “imputation” of missing values.
Dorian Pyle [11] calls such introduced variables “pseudo variables”, others call them “latent” or even “hidden variables”.14 Any of these labels is inappropriate, since the transformation is nothing else than a measurement device. Introduced variables are just variables, nothing else.
Indeed, these labels are reliable markers: whenever you meet a book or article dealing with data exploration, data preparation, the “problem” of selecting a method, or likewise, selecting an architecture within a metamethod like the Artificial Neural Networks, you can know for sure that the author is not really interested in learning and reliable predictions. (Or, that he or she is not able to distinguish analysis from construction.)
In epistemic learning the handling of residuals is somewhat inverse to their treatment in statistics, again as a result of the conceptual difference between the linear and the circular approach. In statistics one tries to prove that the model, say: transformation, removes all the structure from the data such that the remaining variation is pure white noise. Unfortunately, there are two drawbacks with this. First, one has to define the model before removing the noise and before checking the predictive power. Secondly, the test for any possibly remaining structure again takes place within the atomistic framework.
In learning we are interested in the opposite. We are looking for such transformations which remove the noise in a multivariate manner such that the signalnoise ratio is strongly enhanced, perhaps even to the protosymbolic level. Only after the denoising due to the learning process, that is after a successful validation of the predictive model, the structure is then described for the (almost) noisefree data segment15 as an expression that is complementary to the predictive model.
In our opinion an appropriate approach would actualize as an instance of epistemic learning that is characterized by
 – conceiving any method as transformation;
 – conceiving measurement as an instance of transformation;
 – conceiving any kind of transformation as a hypothesis about the “space of expressibility” (see next section), or, similarly, the finally selected model;
 – the separation of transformation and association;
 – the circular arrangement of transformation and association.
The Abstract Perspective
We now have to take a brief look onto the mechanics of transformations in the domain of epistemic activities.16 For doing this, we need a proper perspective. As such we choose the notion of space. Yet, we would like to emphasize that this space is not necessarily Euclidean, i.e. flat, or open, like the Cartesian space, i.e. if quantities running to infinite. Else, dimensions need not be thought of as being “independent”, i.e. orthogonal on each other. Distance measures need to be defined only locally, yet, without implying ideal continuity. There might be a certain kind of “graininess” defined by a distance D, below which the space is not defined. The space may even contain “bubbles” of lower dimensionality. So, it is indeed a very general notion of “space”.
Observations shall be represented as “points” in this space. Since these “points” are not independent from the efforts of the observer, these points are not dimensionless. To put it more precisely, they are like small “clouds”, that are best described as probability densities for “finding” a particular observation. Of course, this “finding” is kind of an inextricable mixture of “finding” and “constructing”. It does not make much sense to distinguish both on the level of such cloudy points. Note, that the cloudiness is not a problem of accuracy in measurement! A posteriori, that is, subsequent to introducing an irreversible move17, such a cloud could also be interpreted as an open set of the provoked observation and virtual observations. It should be clear by now that such a concept of space is very different from the Euclidean space that nowadays serves as a base concept for any statistics or data mining. If you think that conceiving such a space is not necessary or even nonsense, then think about quantum physics. In quantum physics we also are faced with the breakdown of observer and observable, and they ended up quite precisely in spaces as we described it above. These spaces then are handled by various means of renormalization methods.18 In contrast to the abstract yet still physical space of quantum theory, our space need not even contain an “origin”. Elsewhere we called such a space aspectional space.
Now let us take the important step in becoming interested in only a subset of these observations. Assume we not only want to select a very particular set of observations—they are still clouds of probabilities, made from virtual observations—by means of prediction. This selection now can be conceived in two different ways. The first way is the one that is commonly applied and consists of the reconstruction of a “path”. Since in the contemporary epistemic life form of “data analysts” Cartesian spaces are used almost exclusively, all these selection paths start from the origin of the coordinate system. The endpoint of the path is the point of interest, the “outcome” that should be predicted. As a result, one first gets a mapping function from predictor variables to the outcome variable. All possible mappings form the space of mappings, which is a category in the mathematical sense.
The alternative view does not construct such a path within a fixed coordinate system, i.e. with a space with fixed properties. Quite to the contrast, the space itself gets warped and transformed until very simple figures appear, which represent the various subsets of observations according to the focused quality.
Imagine an ordinary, small, blownup balloon. Next, imagine a grid in the space enclosed by the balloon’s hull, made by very thin threads. These threads shall represent the space itself. Of course, in our example the space is 3d, but it is not limited to this case. Now think of two kinds of small pearls attached to the threads all over the grid inside the balloon, blue ones and red ones. It shall be the red ones in which we are interested. The question now is what can we do to separate the blue ones from the red ones?
The way to proceed is pretty obvious, though the solution itself may be difficult to achieve. What we can try is to warp and to twist, to stretch, to wring and to fold the balloon in such a way that the blue pearls and the red pearls separate as nicely as possible. In order to purify the groups we may even consider to compress some regions of the space inside the balloon such that they are turn into singularities. After all this work—and beware it is hard work!—we introduce a new grid of threads into the distorted space and dissolve the old ones. All pearls automatically attach to the threads closest nearby, stabilizing the new space. Again, conceiving of such a space may seem weird, but again we can find a close relative in physics, the Einsteinian space of spacetime. Gravitation effectively is warping that space, though in a continuous manner. There are famous empirical proofs of that warping of physical spacetime.19
Analytically, these two perspectives, the path reconstruction on the hand and the space warping on the other, are (almost) equivalent. The perspective of space warping, however, offers a benefit that is not to be underestimated. We arrive at a new space for which we can define its own properties and in which we again can define measures that are different from those possible in the original space. The path reconstruction does not offer such a “a derived space”. Hence, once the path is reconstructed, the story stops. It is a linear story. Our proposal thus is to change perspective.
Warping the space of measurability and expressibility is an operation that inverts the generation of cusp catastrophes.20 (see Figure 1 below). Thus it transcends the cusp catastrophes. In the perspective of path reconstruction one has to avoid the phenomenon of hysteresis and cusps altogether, hence loosing a lot of information about the observed source of data.
In the Cartesian space and the path reconstruction methodology related to it, all operations are analytic, that is organized as symbolic rewriting. The reason for this is the necessity for the paths remaining continuous and closed. In contrast, space warping can be applied locally. Warping spaces in dealing with data is not an exotic or rare activity at all. It happens all the time. We know it even from (simple) mathematics, when we define different functions, including the empty function, for different sets of input parameter values.
The main consequence of changing the perspective from path reconstruction to space warping is an enlargement of the set of possible expressions. We can do more without the need to call it “heuristics”. Our guess is that any serious theory of data and measurement must follow the opened route of space warping, if this theory of data tries to avoid positivistic reductionism. Most likely, such a theory will be kind of a renormalization theory in a connected, relativistic data space.
Revitalizing Punch Cards and Stacks
In this section we will introduce the outline of a tool that allows to follow the circular approach in epistemic activities. Basically, this tool is about organizing arbitrary transformations. While for analytic (mathematical) expressions there are expression interpreters it is also clear that analytic expressions form only a subset of the set of all possible transformations, even if we consider the fact that many expression interpreters have been growing to some kind of programming languages, or script language. Indeed, Java contains an interpreting engine for JavaScript by default, and there are several quite popular ones for mathematical purposes. One could also conceive mathematical packages like Octave (open source), MatLab or Mathematica (both commercial) as such expression interpreters, even as their most recent versions can do much, much more. Yet, using MatLab & Co. are not quite suitable as a platform for general purpose data transformation.
The structural metaphor that proofed to be as powerful as it was sustainable for more than 10 years now is the combination of the workbench with the punch card stack.
Image 1: A Punched Card for feeding data into a computer
Any particular method, mathematical expression or arbitrary computational procedure resulting in a transformation of the original data is conceived as a “punch card”. This provides a proper modularization, and hence standardization. Actually, the role of these “functional compartments” is extremely standardized, at least enough to define an interface for plugins. Like the ancient punch cards made from paper, each card represents a more or less fixed functionality. Of course, these functionality may be defined by a plugin that itself connects to Matlab…
Else, again like the ancient punch cards, the virtualized versions can be stacked. For instance, we first put the treatment for missing values onto the stack, simply to ensure that all NULLS are written as 1. The next card then determines minimum and maximum in order to provide the data for linear normalization, i.e. the mapping of all values into the interval [0..1]. Then we add a card for compressing the “fat tail” of the distribution of values in a particular variable. Alternatively we may use a card to split the “fat tail” off into a new variable! Finally we apply the card=plugin for normalizing the data to the original and the new data column.
I think you got the idea. Such a stack is not only maintained for any of the variables, it is created on the fly according to the needs as these got detected by simple rules. You may think of the cards also as the set of rules that describe the capabilities of agents, which constantly check the data whether they could apply their rules. You also may think of these stacks as a device that works like a tailored distillation column , as it is used for fractional distillation in petrochemistry.
Image 2: Some industrial fractional distillation columns for processing mineral oil. Dependent on the number of distillation steps different products result.
These stacks of parameterized procedures and expressions represent a generally programmable computer, or more precisely, operating system, quite similar to a spreadsheet, albeit the purpose of the latter, and hence the functionality, actualizes in a different form. The whole thing may even be realized as a language! In this case, one would not need a graphical userinterface anymore.
The effect of organizing the transformation of data in this way, by means of plugins that follow the metaphor of the “punch card stack”, is dramatic. Introducing transformations and testing them can be automated. At this point we should mention about the natural ally of the transformation workbench, the maximum likelihood estimation of the most promising transformations that combine just two or three variables into a new one. All three parts, the transformation stack engine, the dependency explorer, and the evolutionary optimized associative engine (which is able to create a preference weighting for the variables) can be put together in such a way that finding the “optimal” model can be run in a fully automated manner. (Meanwhile the SomFluid package has grown into a stage where it can accomplish this. . . download it here, but you need still some technical expertise to make it running)
The approach of the “transformation stack engine” is not just applicable to tabular data, of course. Given a set of proper plugins, it can be used as a digester for large sets of images or time series as well (see below).
Transforming Data
In this section we now will take a more practical and pragmatic perspective. Actually, we will describe some of the most useful transformations, including their parameters. We do so, because even prominent books about “data mining” have been handling the issue of transforming data in a mistaken or at least seriously misleading manner.21,22
If we consider the goal of the transformation of numerical data, increasing the discernibility of assignated observations , we will recognize that we may identify a rather limited number of types of such transformations, even if we consider the space of possible analytic functions, which combine two (or three) variables.
We will organize the discussion of the transformations into three subsections, whose subjects are of increasing complexity. Hence, we will start with the (ordinary) table of data.
Tabular Data
Tables may comprise numerical data or strings of characters. In its general form it may even contain whole texts, a complete book in any of the cells of a column (but see the section about unstructured data below!). If we want to access the information carried by the string data, we more sooner than later have to translate them into numbers. Unlike numbers, string data, and the relations between data points made from string data, must be interpreted. As a consequence, there are always several, if not many different possibilities of that representation. Besides referring to the actual semantics of the strings that could be expressed by means of the indices of some preference orders, there are also two important techniques of automatic scaling available, which we will describe below.
Besides string data, dates are further multidimensional category of data. A date encodes not only a serial number relative to some (almost) arbitrarily chosen base date, which we can use to express the age of the item represented by the observation. We have, of course, day of week, day of month, number of week, number of month, and not to forget about season as an approximate class. It depends a bit on the domain whether these aspects play any role at all. Yet, think about the rhythms in the city or on the stock markets across the week, or the “black Monday/ Tuesday/Friday effect” in production plants or hospitals then it is clear that we usually have to represent the single date value by several “informational extracts”.
A last class of data types that we have to distinguish are time values. We already mentioned the periodicity in other aspects of the calendar. In which pair of time values we find a closer similarity, T1( 23:41pm, 0:05pm), or T2(8:58am;3:17pm)? In case of any kind of distance measure the values of T2 are evaluated as much more similar than those in T1. What we have to do is to set a flag for “circularity” in order to calculate the time distances correctly.
Numerical Data: Numbers, just Numbers?
Numerical data are data for which in principle any value from within a particular interval could be observed. If such data are symmetrically normal distributed then we have little reasons to guess that there is something interesting within these sample of values. As soon as the distribution becomes asymmetrical, it starts to become interesting. We may observe “fat tails” (large values are “overrepresented), or multimodal distributions. In both cases we could suspect that there are at least two different processes, one dominating the other differentially across peaks. So we should split the variable into two (called “deciling”) and ceteris paribus check out the effect on the predictive power of the model. Typically one splits the values at the minimum between the peaks, but it is also possible to implement an overlap, where some records are present in both of the new variables.
Long tails indicate some aberrant behavior of the items represented by the respective records, or, like in medicine even pathological contexts. Strongly leftskewed distribution often indicate organizational or institutional influences. Here we could compress the long tail, logshift, and then split the variable, that is decile it into two. 21
In some domains, like the finances, we find special values at which symmetry breaks. For ordinary money values the 0 is such a value. We know in advance that we have to split the variable into two, because the semantic and the structural difference between +50$ and 75$ is much bigger than between 150$ and 2500$… probably. As always, we transform it such that we create additional variables as kind of a hypotheses, for which we have to evaluate their (positive) contribution to the predictive power of the model.
In finances, but also in medicine, and more general in any system that is able to develop metastable regions, we have to expect such points (or regions) with increased probability of breaking symmetry and hence strong semantic or structural difference. René Thom first described similar phenomena by his theory that he labeled “catastrophe theory”. In 3d you can easily think about cusp catastrophes as a hysteresis in xz direction that is however gradually smoothed out in ydirection.
Figure 1: Visualization of folds in parameters space, leading to catastrophes and hystereses.
In finances we are faced with a whole culture of rule following. The majority of market analysts use the same tools, for instance “stochasticity,” or a particularly parameterized MACD for deriving “signals”, that is, indicators for points of actions. The financial industries have been hiring a lot of physicists, and this population sticks to greatly the same mathematics, such as GARCH, combined with MonteCarloSimulations. Approaches like fractal geometry are still regarded as exotic.23
Or think about option prices, where we find several symmetry breaks by means of contract. These points have to be represented adequately in dedicated, means derived variables. Again, we can’t emphasize it enough, we HAVE to do so as a kind of performing hypothesizing. The transformation of data by creating new variables is, so to speak, the lowlevel operationalization of what later may grow into a scientific hypothesis. Creating new variables poses serious problems for most methods, which may count as a reason why many people don’t follow this approach. Yet, for our approach it is not a problem, definitely not.
In medicine we often find “norm values”. Potassium in blood serum may take any value within a particular range without reflecting any physiologic problem. . . if the person is healthy. If there are other risk factors the story may be a different one. The ratio of potassium and glucose in serum provides us an example for a significant marker. . . if the person has already heart problems. By means of such risk markers we can introduce domainspecific knowledge. And that’s actually a good message, since we can identify our own “markers” and represent it as a transformation. The consequence is pretty clear: a system that is supposed to “learn” needs a suitable repository for storing and handling such markers, represented as a relational system (graph).
Let us return to the norm ranges briefly again. A small difference outside the norm range could be rated much more strongly than within the norm range. This may lead to the weight functions shown in the next figure, or more or less similar ones. For a certain range of input values, the norm range, we leave the values unchanged. The output weight equals 1. Outside of this range we transform them in a way that emphasizes the difference to the respective boundary value of the norm range. This could be done in different ways.
Figure 2: Examples for output weight configurations in normrange transformation
Actually, this rationale of the norm range can be applied to any numerical data. As an estimate of the norm range one could use the 80% quantile, centered around the median and realized as +/40% quantiles. On the level of model selection, this will result in a particular sensitivity for multidimensional outliers, notably before defining any criterion apriori of what an outlier should be.
From Strings to Orders to Numbers
Many data come as some kind of description or label. Such data are described as nominal data. Think for instance about prescribed drugs in a group of patients included into an investigation of risk factors for a disease, or think about the name or the type of restaurants in a urbanological/urbanistic investigation. Nominal data are quite frequent in behavioral, organizational or social data, that is, in contexts that are established mainly on a symbolic level.
It should be avoided to perform measurements only on the nominal scale, yet, sometimes it is not possible to circumvent it. It could be avoided at least partially by including further properties that can be represented by numerical values. For instance, instead using only the names cities in a data set, one can use the geographical location, number of inhabitants, or when referring to places within a city one can use descriptors that cover some properties of the respective area, such items as density of traffic, distance to similar locations, price level of consumer goods, economical structure etc. If a direct measurement is not possible, estimates can do the job as well, if the certainty of the estimate is expressed. The certainty then can be used to generate surrogate data. If the fine grained measurement creates further nominal variables, they could be combined for form a scale. Such enrichment is almost always possible, irrespective the domain. One should keep in mind, however, that any such enrichment is nothing else than a hypothesis.
Sometimes, data on the nominal level, technically a string of alphanumerical characters, already contains valuable information. For instance, the contain numerical values, as in the name of cars. If we would deal with things like names of molecules, where these names often come as compounds, reflecting the fact that molecules themselves are compounds, we can calculate the distance of each name to a virtual “average name” by applying a technique called “random graph”. Of course, in case of molecules we would have a lot of properties available that can be expressed as numerical values.
Ordinal data are closely related to nominal data. Essentially, there are two flavors of them. In case of the least valuable of them the numbers to not express a numerical value, the cipher is just used as kind of a letter, indicating that there is a set of sortable items. Sometimes, values of an ordinal scale represent some kind of similarity. Despite this variant is more valuable it still can be misleading, because the similarity may not scale isodistantly with the numerical values of the ciphers. Undeniably, there is still a rest of a “name” in it.
We are now going to describe some transformations to deal with data from lowlevel scales.
The least action we have to apply to nominal data is a basic form of encoding. We use integer values instead of the names. The next, though only slightly better level would be to reflect the frequency of the encoded item in the ordinal value. One would, for instance not encode the name into an arbitrary integer value, but into the log of the frequency. A much better alternative, however, is provided by the descendants of the correspondence analysis. These are called Optimal Scaling and the Relative Risk Weight. The drawback for these method is that some information about the predicted variable is necessary. In the context of modeling, by which we always understand targetoriented modeling—as opposed to associative storage24—we usually find such information, so the drawback is not too severe.
First to optimal scaling (OSC). Imagine a variable, or “assignate” as we prefer to call it25, which is scaled on the nominal or the low ordinal scale. Let us assume that there are just three different names or values. As already mentioned, we assume that a purpose has been selected and hence a target variable as its operationalization is available. Then we could set up the following table (the figures are denoting frequencies).
Table 1: Summary table derived from a hypothetical example data set. av(i) denote three nominally scaled assignates.
outcome_{tv} 
av_{1} 
av_{2} 
av_{3} 
marginal sum 
ta 
140 
120 
160 
420 
tf (focused) 
30 
10 
40 
80 
marginal sum 
170 
130 
200 
500 
From these figures we can calculate the new scale values by the formula
For the assignate av_{1} this yields
Table 2: Here, various encodings are contrasted.
assignate 
literal encoding 
frequency 
normalized log(freq) 
optimal scaling 
normalized OSC 
av_{1} 
1 
170 
0.62 
0.176 
0.809 
av_{2} 
2 
130 
0.0 
0.077 
0.0 
av_{3} 
3 
200 
1.0 
0.200 
1.0 
Using these values we could replace any occurrence of the original nominal (ordinal) values by the scaled values. Alternatively—or better additionally—, we could sum up all values for each observation (record), thereby collapsing the nominally scaled assignates into a single numerically scaled one.
Now we will describe the RRW. Imagine a set of observations {o(i)} where each observation is described by a set of assignates a(i). Also let us assume that some of these assignates are on the binary level, that is, the presence of this quality in the observation is encoded by “1”, its missing by “0”. This usually results in sparsely filled (regions of ) the data table. Depending on the size of the “alphabet”, even more than 99.9% of all values could simply be equal to 0. Such data can not be grouped in a reasonable manner. Additionally, if there are further assignates in the table that are not binary encoded, the information in the binary variables would be neglected almost completely without applying a rescaling like the RRW.
For the assignate av_{1} this yields
As you can see, the RRW uses the marginal from the rows, while the optimal scaling uses the marginal from the columns. Thus, the RRW uses slightly more information. Assuming a table made from binary assignates av(i), which could be summarized into table 1 above, the formula yields the following RRW factors for the three binary scaled assignates:
Table 3: Relative Risk Weights (RRW) for the frequency data shown in table 1.
Assignate 
raw RRW_{i} 
RRW_{i} 
normalized RRW 
av_{1} 
1.13 
0.33 
0.82 
av_{2} 
0.44 
0.16 
0.00 
av_{3} 
1.31 
0.36 
1.00 
The ranking of av(i) based RRW is equal to that returned by OSC, even the normalized score values are quite similar. Yet, while in the case of nominal variables assignates are usually not collapsed, this will be done always in case of binary variables.
So, let us summarize these simple methods in the following table.
Table 4: Overview about some of the most important transformations for tabular data.
Transformation 
Mechanism 
Effect, New Value 
Properties, Conditions 
logtransform 
analytic function 

analytic combination 
explicit analytic function (a,b)→f(a,b) 
enhancing signaltonoise ratio for the relationship between predictors and predicted, 1 new variable 
targeted modeling 
empiric combinational recoding 
using simple clustering methods like KNN or Kmeans for a small number of assignates 
distance from cluster centers and, or cluster center as new variables 
targeted modeling 
Deciling 
upon evaluation of properties of the distribution 
2 new variables 

Collapsing 
based on extremevalue quantiles 
1 new variable, better distinction for data in frequent bins 

optimal scaling 
numerical encoding and/or rescaling using marginal sums 
enhancing the scaling of the assignate from nominal to numerical 
targeted modeling 
relative risk weight 
dto. 
collapsing sets of sparsely filled variables 
targeted modeling 
Obviously, the transformation of data is not an analytical act, on both sides. Lefthand it refers to structural and hence semantic assumptions, while right hand it introduces hypotheses about those assumptions. Numbers are never ever just values, much like sentences and words do not consists just from letters. After all, the difference between both is probably less than one could initially presume. Later we will address this aspect from the opposite direction, when it comes to the translation of textual entities into numbers.
Time Series and Contexts
Time series data are the most valuable data. They allow the reconstruction of the flow of information in the observed system, either between variables intrinsic to the measurement setup (reflecting the “system”) or between treatment and effects. In the recent years, socalled “causal FFT” gained some popularity.
Yet, modeling time series data poses the same problematics as tabular data. We do not know apriori which variables to include, or how to transform variables in order to reflect particular parts of the information in the most suitable way. Simply pressing a FFT onto the data is nothing but naive. FFT assumes a harmonic oscillation, or a combination thereof, which certainly is not appropriate. Even if we interpret a long series of FFT terms as an approximation to an unknown function, it is by no means clear whether the then assumed stationarity26 is indeed present in the data.
Instead, it is more appropriate to represent the aspects of a time series in multiple ways. Often, there are many time series available, one for each assignate. This brings the additional problem of careful evaluation of crosscorrelations and autocorrelations, and all of this under the condition that it is not known apriori whether the evolution of the system is stationary.
Fortunately, the analysis of multiple time series, even from nonstationary processes, is quite simple, if we follow the approach as outlined so far. Let us assume a set of assignates {a(i)} for which we have their time series measurement available, which are given by equidistant measurement points. A transformation then is constructed by a method m that is applied to a moving window of size md(k). All moving windows of any size are adjusted such that their endpoints meet at the measurement point at time t(m(k)). Let us call this point the prediction base point, T(p). The transformed values consist either from the residuals resulting from this methods values and the measurement data, or the parameters of the method fitted to the moving window. A example for the latter case are for instance given by the wavelet coefficients, which provide a quite suitable, multifrequency perspective onto the development up to T(p). Of course, the time series data of different assignates could be related to each other by any arbitrary functional mapping.
The target value for the model could be any set of future points relative to t(m(k)). The model may predict a singular point, averages some time in the future, the volatility of the future development of the time series, or even the parameters of a particular mapping function relating several assignates. In the latter case the model would predict several criteria at once.
Such transformations yield a table that contain a lot more variables than originally available. The ratio may grow up to 1:100 in complex cases like the global financial markets. Just to be clear: If you measure, say the index values of 5 stock markets, some commodities like gold, copper, precious metals and “electronics metals”, the money market, bonds and some fundamentals alike, that is approx. 30 basic input variables, even a superficial analysis would have to inspect 3000 variables… Yes, learning and gaining experience can take quite a bit! Learning and experience do not become cheaper only for that we use machines to achieve it. Just exploring is more easy nowadays, not requiring life times any more. The reward consists from stable models about complex issues.
Each point in time is reflected by the original observational values and a lot of variables that express the most recent history relative to the point in time represented by the respective record. Any of the synthetic records thus may be interpreted as a set of hypothesis about the future development, where this hypothesis comes as a multidimensional description of the context up to T(p). It is then the task of the evolutionarily optimized variable selection based on the SOM to select the most appropriate hypothesis. Any subgroup contained in the SOM then represents comparable sets of relations between the past relative to T(p) and the respective future as it is operationalized into the target variable.
Typical transformations in such associative time series modeling are
 – moving average and exponentially decaying moving average for deseasoning or detrending;
 – various correlational methods: cross and autocorrelation, including the result parameters of the Bartlett test;
 – Wavelet, FFT, or Walsh transforms of different order, residuals to the denoised reconstruction;
 – fractal coefficients like Lyapunov coefficient or Hausdorff dimension
 – ratios of simple regressions calculated over moving windows of different size;
 – domain specific markers (think of technical stock market analysis, or ECG.
Once we have expressed a collection of time series as series of contexts preceding the prediction point T(p), the further modeling procedure does not differ from the modeling of ordinary tabular data, where the observations are independent from each other. From the perspective of our transformation tool, these time series transformation are nothing else than “methods”, they do not differ from other plugin methods with respect to the procedure calls in their programing interface.
„Unstructurable“ „Data“: Images and Texts
The last type of data for which we briefly would like to discuss the issue of transformation is “unstructurable” data. Images and texts are the main representatives for this class of entities. Why are these data “unstructurable”?
Let us answer this question from the perspective of textual analysis. Here, the reason is obvious, actually, there are several obvious reasons. Patrizia Violi [17] for instance emphasizes that words are creating their own context, upon which they are then going to be interpreted. Douglas Hofstadter extended the problematics to thinking at large, arguing that for any instance of analogical thinking—and any thinking he claimed as being analogical—it is impossible to define criteria that would allow to set up a table. Here on this site we argued repeatedly that it is not possible to define any criteria apriori that would capture the “meaning” of a text.
Else, understanding language, as well as understanding texts can’t be mapped to the problematics of predicting a time series. In language, there is no such thin as a prediction point T(p), and there is no positively definable “target” which could be predicted. The main reason for this is the special dynamics between context (background) and proposition (figure). It is a multilevel, multiscale thing. It is ridiculous to apply ngrams to text, then hoping to catch anything “meaningful”. The same is true for any statistical measure.
Nevertheless, using language, that is, producing and understanding is based on processes that select and compose. In some way there must be some kind of modeling. We already proposed a structure, or more, an architecture, for this in a previous essay.
The basic trick consists of two moves: Firstly, texts are represented probabilistically as random contexts in an associative storage like the SOM. No variable selection takes place here, no modeling and no operationalization of a purpose is present. Secondly, this representation then is used as a basis for targeted modeling. Yet, the “content” of this representation does not consist from “language” data anymore. Strikingly different, it contains data about the relative location of language concepts and their sequence as they occur as random contexts in a text.
The basic task in understanding language is to accomplish the progress from a probabilistic representation to a symbolic tabular representation. Note that any tabular representation of an observation is already on the symbolic level. In the case of language understanding precisely this is not possible: We can’t define meaning, and above all, not apriori. Meaning appears as a consequence of performance and execution of certain rules to a certain degree. Hence we can’t provide the symbols apriori that would be necessary to set up a table for modeling, assessing “similarity” etc.
Now, instead of probabilistic nonstructured representation we also could say arbitrary unstable structure. From this we should derive a structured, (proto)symbolic and hence tabular and almost stable structure. The trick to accomplish this consists of using the modeling system itself as measurement device and thus also as a “root” for further reference in the then possible models. Kohonen and colleagues demonstrated this crucial step in their WebSom project. Unfortunately (for them), they then actualized several misunderstandings regarding modeling. For instance, they misinterpreted associative storage as a kind of model.
The nice thing with this architecture is that once the symbolic level has been achieved, any of the steps of our modeling approach can be applied without any change, including the automated transformation of “data” as described above.
Understanding the meaning of images follows the same scheme. The fact that there are no words renders the task more complicated and more simple at the same time. Note that so far there is no system that would have learned to “see”, to recognize and to understand images, despite many titles claim that the proposed “system” can do so. All computer vision approaches are analytic by nature, hence they are all deeply inadequate. The community is running straight into the method hell as the statisticians and the data miners did before, mistaking transformations as methods, conflating transformation and modeling, etc.. We discussed this issues at length above. Any of the approaches might be intelligently designed, but all are victimized by the representationalist fallacy, and probably even by naive realism. Due to the fact that the analytic approach is first, second and third mainstream, the probabilistic and contextual bottomup approach is missing so far. In the same way as a word is not equal to the grapheme, a line is not defined on the symbolic level in the brain. We else and again meet the problem of analogical thinking even on the most primitive graphical level. When is a line still a line, when is a triangle still a triangle?
In order to start in the right way we first have to represent the physical properties of the image along different dimensions, such as textures, edges, or salient points, and all of those across different scales. Probably one can even detect salient objects by some analytic procedure. From any of the derived representations the random contexts are derived and arranged as vectors. A single image is represented as a table that contains random contexts derived from the image as a physical entity. From here on, the further processing scheme is the same as for texts. Note, that there is no such property as “line” in this basic mapping.
In case of texts and images the basic transformation steps thus consist in creating the representation as random contexts. Fortunately, this is “only” a question of the suitable plugins for our transformation tool. In both cases, for texts as well as images, the resulting vectors could grow considerably. Several thousands of implied variables must be expected. Again, there is already a solution, known as random projection, which allows to compress even very large vectors (say 20’000+) into one of say maximal 150 variables, without loosing much of the information that is needed to retain the distinctive potential. Random projection works by multiplying a vector of size N with a matrix of uniformly distributed random values of size NxM, which results in a vector of size M. Of course, M is chosen suitably (100+). The reason why this works is that with that many dimension all vectors are approximately orthogonal to each other! Of course, the resulting fields in such a vector do not “represent” anything that could be conceived as a reference to an “object”. Internally, however, that is from the perspective of a (population of) SOMs, it may well be used as a (almost) fixed “attribute”. Yet, neither the missing direct reference not the subjectivity poses a problem, as the meaning is not a mental entity anyway. Q.E.D.
Conclusion
Here in this essay we discussed several aspects related to the transformation of data as an epistemic activity. We emphasized that an appropriate attitude towards the transformation of data requires a shift in perspective and the focus of another vantage point. One of the more significant changes in attitude consider, perhaps, the drop of any positivist approach as one of the main pillars of traditional modeling. Remember that statistics is such a positivist approach. In our perspective, statistical methods are just transformations, nothing less, but above all also nothing more, characterized by a specific set of rather strong assumptions and conditions for their applicability.
We also provided some important practical examples for the transformation of data, whether tabular data derived from independent observations, time series data or “unstructurable” “data” like texts and images. According to the proposed approach we else described a prototypical architecture for a transformation tool, that could be used universally. In particular, it allows a complete automation of the modeling task, as it could be used for instance in the field of socalled data mining. The possibility for automated modeling is, of course, a fundamental requirement for any machinebased episteme.
1. The only reason why we do not refer to cultures and philosophies outside Europe is that we do not know sufficient details about them. Yet, I am pretty sure that taking into account Chinese or Indian philosophy would severe the situation.
2. It was Friedrich Schleiermacher who first observed that even the text becomes alien and at least partially autonomous to its author due to the necessity and inevitability of interpretation. Thereby he founded hermeneutics.
3. In German language these words all exhibit a multiple meaning.
4. In the last 10 years (roughly) it became clear that the genecentered paradigms are not only not sufficient [2], they are even seriously defect. Evely FoxKeller draws a detailed trace of this weird paradigm [3].
5. Michel Foucault [4]
6. The „axiom of choice“ is one of the founding axioms in mathematics. Its importance can’t be underestimated. Basically, it assumes that “something is choosable”. The notion of “something choosable” then is used to construct countability as a derived domain. This implies three consequences. First, this avoids to assume countability, that is, the effect of a preceding symbolification, as a basis for set theory. Secondly, it puts performance at the first place. These two implications render the “Axiom of Choice” into a doublyarticulated rule, offering two docking sites, one for mathematics, and one for philosophy. In some way, it thus can not count as an “axiom”. Those implications are, for instance, fully compatible with Wittgenstein’s philosophy. For these reasons, Zermelo’s “axiom” may even serve as a shared point (of departure) for a theory of machinebased episteme. Finally, the third implication is that through the performance of the selection the relation, notably a somewhat empty relation is conceived as a predecessor of countability and the symbolic level. Interestingly, this also relates to Quantum Darwinism and String Theory.
7. David Grahame Shane’s theory on cities and urban entities [5] is probably the only theory in urbanism that is truly a relational theory. Additionally, his work is full of relational techniques and concepts, such as the “heterotopy” (a term coined by Foucault).
8. Bruno Latour developed the ActorNetworkTheory [6,7], while Clarke evolved “Grounded Theory” into the concept of “Situational Analysis” [8]. Latour, as well as Clarke, emphasize and focus the relation as a significant entity.
9. behavioral coating, and behavioral surfaces ;
10. See Information & Causality about the relation between measurement, information and causality.
11. „Passivist“ refers to the inadequate form of realism according to which things exist assuch independently from interpretation. Of course, interpretation does affect the material dimension of a thing. Yet, it changes its relations insofar the relations of a thing, the Wittgensteinian “facts”, are visible and effective only if we assign actively significance to them. The “passivist” stance conceives itself as a reconstruction instead of a construction (cf. Searle [9])
12. In [10] we developed an image theory in the context of the discussion about the mediality of facades of buildings.
13. nonsense of „nonsupervised clustering“
14. In his otherwise quite readable book [11], though it may serve only as an introduction.
15. This can be accomplished by using a data segment for which the implied risk equals 0 (positive predictive value = 1). We described this issue in the preceding chapter.
16. hint to particle physics…
17. See our previous essay about the complementarity of the concepts of causality and information.
18. For an introduction of renormalization (in physics) see [12], and a bit more technical [13]
19. see the Wiki entry about socalled gravitational lenses.
20. Catastrophe theory is a concept invented and developed by French mathematician Rene Thom as a field of Differential Topology. cf. [14]
21. In their book, Witten & Eibe [15] recognized the importance of transformation and included a dedicated chapter about it. They also explicitly mention the creation of synthetic variables. Yet, they do also explicitly retreat from it as a practical means for the reason of computational complexity (=here, the time needed to perform a calculation in relation to the amount of data). After all, their attitude towards transformation is somehow that towards an unavoidable evil. They do not recognize its full potential. After all, as a cure for the selection problem, they propose SVM and their hyperplanes, which is definitely a poor recommendation.
22. Dorian Pyle [11]
23. see Benoit Mandelbrot [16].
24. By using almost meaningless labels targetoriented modeling is often called supervised modeling as opposed to “nonsupervised modeling”, where no target variable is being used. Yet, such a modeling is not a model, since the pragmatics of the concept of “model” invariably requires a purpose.
25. About assignates: often called property, or feature… see about modeling
26. Stationarity is a concept in empirical system analysis or description, which denotes the expectation that the internal setup of the observed process will not change across time within the observed period. If a process is rated as “stationary” upon a dedicated test, one could select a particular, and only one particular method or model to reflect the data. Of course, we again meet the chickenegg problem. We can decide about stationarity only by means of a completed model, that is after the analysis. As a consequence, we should not use linear methods, or methods that depend on independence, for checking the stationarity before applying the “actual” method. Such a procedure can not count as a methodology at all. The modeling approach should be stable against nonstationarity. Yet, the problem of the reliability of the available data sample remains, of course. As a means to “robustify” the resulting model against the unknown future one can apply surrogating. Ultimately, however, the only cure is a circular, or recurrent methodology that incorporates learning and adaptation as a structure, not as a result.
 [1] Robert Rosen, Life Itself: A Comprehensive Inquiry into the Nature, Origin, and Fabrication of Life. Columbia University Press, New York 1991.
 [2] Nature Insight: Epigenetics, Supplement Vol. 447 (2007), No. 7143 pp 396440.
 [3] Evelyn Fox Keller, The Century of the Gene. Harvard University Press, Boston 2002. see also: E. Fox Keller, “Is There an Organism in This Text?”, in P. R. Sloam (ed.), Controlling Our Destinies. Historical, Philosophical, Ethical, and Theological Perspectives on the Human Genome Project, Notre Dame (Indiana), University of Notre Dame Press, 2000, pp. 288289
 [4] Michel Foucault, Archeology of Knowledge. 1969.
 [5] David Grahame Shane. Recombinant Urbanism: Conceptual Modeling in Architecture, Urban Design and City Theory
 [6] Bruno Latour. Reassembling The Social. Oxford University Press, Oxford 2005.
 [7] Bruno Latour (1996). On Actornetwork Theory. A few Clarifications. in: Soziale Welt 47, Heft 4, p.369382.
 [8] Adele E. Clarke, Situational Analysis: Grounded Theory after the Postmodern Turn. Sage, Thousand Oaks, CA 2005).
 [9] John R. Searle, The Construction of Social Reality. Free Press, New York 1995.
 [10] Klaus Wassermann & Vera Bühlmann, Streaming Spaces – A short expedition into the space of mediaactive façades. in: Christoph Kronhagel (ed.), Mediatecture, Springer, Wien 2010. pp.334345. available here
 [11] Dorian Pyle, Data Preparation for Data Mining. Morgan Kaufmann, San Francisco 1999.
 [12] John Baez (2009). Renormalization Made Easy. Webpage
 [13] Bertrand Delamotte (2004). A hint of renormalization. Am.J.Phys. 72: 170184. available online.
 [14] Tim Poston & Ian Stewart, Catastrophe Theory and Its Applications. Dover Publ. 1997.
 [15] Ian H. Witten & Frank Eibe, Data Mining. Practical Machine Learning Tools and Techniques (2nd ed.). Elsevier, Oxford 2005.
 [16] Benoit Mandelbrot & Richard L. Hudson, The (Mis)behavior of Markets. Basic Books, New York 2004.
 [17] Patrizia Violi (2000). Prototypicality, typicality, and context. in: Liliana Albertazzi (ed.), Meaning and Cognition – A multidisciplinary approach. Benjamins Publ., Amsterdam 2000. p.103122.
۞
Vagueness: The Structure of NonExistence.
December 29, 2011 § Leave a comment
For many centuries now, clarity has been the major goal of philosophy.
It drove the first instantiation of logics by Aristotle, who devised it as a cure for mysticism, which was considered as a kind of primary chaos in human thinking. Clarity has been the intended goal in the second enlightenment as a cure for scholastic worries, and among many other places we find it in Wittgenstein’s first work, now directed to philosophy itself. In any of those instances, logics served as a main pillar to follow the goal of clarity.
Vagueness seems to act as an opponent to this intention, lurking behind the scenes in any comparison, which is why one may regard it as being as ubiquitous in cognition. There are whole philosophical and linguistic schools dealing with vagueness as their favorite subject. Heather Burnett (UCLA) recently provided a rather comprehensive overview [1] about the various approaches, including own proposals to solve some puzzles of vagueness in language, particularly related to relative and absolute adjectives and their contextdependency. In the domain of scientific linguistics, vagueness is characterized by three related properties: being fuzzy, being borderline, or being susceptible to the sorites (heap) paradox. A lot of rather different proposals for a solution have been suggested so far [1,2], most of them technically quite demanding; yet, none has been generally accepted as a convincing one.
The mere fact that there are many incommensurable theories, models and attitudes about vagueness we take as a clear indication for a still unrecognized framing problem. Actually, in the end we will see that the problem of vagueness in language does not “exist” at all. We will provide a sound solution that does not refer just to the methodological level. If we replace vagueness by the more appropriate term of indeterminacy we readily recognize that we can’t speak about vague and indeterminate things without implicitly talking about logics. In other words, the issue of (nonlinguistic) vagueness triggers the question about the relation between logics and world. This topic we will investigate elsewhere.
Regarding vagueness, let us consider just two examples. The first one is about Peter Unger’s famous example regarding clouds [3]. Where does a cloud end? This question can’t be answered. Close inspection and accurate measurement does not help. It seems as if the vagueness is a property of the “phenomenon” that we call “cloud.” If we conceive it as a particular type of object, we may attest it a resemblance to what is called an “open set” in mathematical topology, or the integral on asymptotic functions. Bertrand Russell, however, would have called this the fallacy of verbalism [4, p.85].
Vagueness and precision alike are characteristics which can only belong to a representation, of which language is an example. […] Apart from representation, whether cognitive or mechanical, there can be no such thing as vagueness or precision;
For Russell, objects can’t be ascribed properties, e.g. vague. Vague is a property of the representation, not of the object. Thus, when Unger concludes that there are no ordinary things, he gets trapped even by several misunderstandings, as we will see. We could add that open sets, i.e. sets without definable border, are not vague at all.
As the second example we take an abundant habit in linguistics when addressing the problem of vagueness, e.g. supervaluationism. This system has the consequence that borderline cases of vague terms yield statements that are neither true, nor false. Despite there is a truthvalue gap induced by that model, it nevertheless keeps the idea of truth values fully intact. All linguistic models about vagueness assume that it is appropriate to apply the idea of truth values, predicates and predicate logics to language.
As far as I can tell from all the sources I have inspected, any approach in linguistics about vagueness is taking place within two very strong assumptions. The first basic assumption is that (1) the concept of “predicates” can be applied to an analysis of language. From that basic assumption, three other more secondary derive. (1.1) Language is a means to transfer clear statements. (1.2) It is possible to use language in a way that no vagueness appears. (1.3) Words are items that can be used to build predicates.
Besides of this first assumption of “predicativity” of language, linguistics further assumes that words could be definite and nonambiguous. Yet, that is not a basic assumption. The basic second assumption of that is that (2) the purpose of language is to transfer meaning unambiguously. Yet, all three aspects of that assumption are questionable, being a purpose, serving as a tool or even a medium to transfer meaning, and to do so unambiguously.
So we summarize: Linguistics employs two strong assumptions:
 (1) The concept of apriori determinable “predicates” can be applied to an analysis of language.
 (2) The purpose of language is to transfer meaning unambiguously.
Our position is that both assumptions are deeply inappropriate. The second one we already dealt with elsewhere, so we focus on the first one here. We will see that the “problematics of vagueness” is nonexistent. We do not claim that there is no vagueness, but we refute that it is a problem. There are also no serious threats from linguistic paradoxes, because these paradoxes are simply a consequence from “silly” behavior.
We will provide several examples to that, but the structure of it is the following. The problematics consists of a performative contradiction to the rules one has set up before. One should not pretend to play a particular game by fixing the rules upon one’s own interests, only to violate those rules a moment later. Of course, one could create a play / game from this, too. Lewis Carroll wrote two books about the bizarre consequences of such a setting. Let us again listen to Russell’s arguments, now to his objection against the “paradoxicity” of “baldness,” which is usually subsumed to the sorites (heap) paradox.
It is supposed that at first he was not bald, that he lost his hairs one by one, and that in the end he was bald; therefore, it is argued, there must have been one hair the loss of which converted him into a bald man. This, of course, is absurd. Baldness is a vague conception; some men are certainly bald, some are certainly not bald, while between them there are men of whom it is not true to say they must either be bald or not bald. The law of excluded middle is true when precise symbols are employed, but it is not true when symbols are vague, as, in fact, all symbols are.
Now, describing the heap (Greek: sorites) or the hair of “balding” men by referring to countable parts of the whole, i.e. either sand particles or singularized hairs, contradicts the conception of baldness. Confronting both in a direct manner (removing hair by hair) mixes two different games. Mixing soccer and tennis is “silly,” especially after the participants have declared that they intend to play soccer, mixing vagueness and counting is silly, too, for the same reason.
This should make clear why the application of the concept of “predicates” to vague concepts, i.e. concepts that are apriori defined as to be vague, is simply absurd. Remember, even a highly innovative philosopher as Russell, coauthor of an extremely abstract work as the Principia Mathematica is, needed several years to accept Wittgenstein’s analysis that the usage of symbols in the Principia is selfcontradictory, because actualized symbols are never free of semantics.
Words are NonAnalytic Entities
First I would recall an observation first, or at least popularly, expressed by Augustinus. His concern was the notion of time. I’ll give a sketch of it in my words. As long as he simply uses the word, he perfectly knows what time is. Yet, as soon as he starts to think about time, trying to get an analytic grip onto it, he increasingly looses touch and understanding, until he does not know anything about it at all.
This phenomenon is not limited to the analysis of a concept like time, which some conceive even as a transcendental “something.” The phenomenon of disappearance by close inspection is not unknown. We meet it in Carroll’s character of the Cheshire cat, and we meet it in Quantum physics. Let us call this phenomenon the CQphenomenon.
Ultimately, the CQphenomenon is a consequence of the selfreferentiality of language and selfreferentiality of the investigation of language. It is not possible to apply a scale to itself without getting into some serious troubles like fundamental paradoxicity. The language game of “scale” implies a separation of observer and observed that can’t be maintained in the cases of the cat, the quantum, or language. Of course, there are ways to avoid such difficulties, but only to high costs. For instance, a strong regulations or very strict conventions can be imposed to the investigation of such areas ad the application of selfreferential scales, to which one may count linguistics, sociology, cognitive sciences, and of course quantum physics. Actually, positivism is nothing else than such a “strong convention”. Yet, even with such strong conventions being applied, the results of such investigations are surprising and arbitrary, far from being a consequence of rationalist research, because selfreferential system are always immanently creative.
It is more than salient that linguists create models about vagueness that are subsumed to language. This position is deeply nonsensical and does not only purport ontological relevance for language, it implicitly also claims a certain “immediacy” for the linkage between language and empirical aspects of the world.
Our position is strongly different from that: models are entities that are “completely” outside of language. Of course, they are not separable from each other. We will deal elsewhere with this mutual dependency in more details and a more appropriate framing. Regardless how modeling and language are related, they definitely can not be related in the way linguistics implicitly assumes. It is impossible to use language to transfer meaning, because it is in principle not possible to transfer meaning at all. Of course, this opens the question what then is going to be “transferred.”
This brings us to the next objection against the presumed predicativity of language, namely its role in social intercourse, from which the CQphenomenon can’t be completely separated from.
Language: What can be Said
Many things and thoughts are not explicable. Many things also can be just demonstrated, but not expressed in any kind of language. Yet, despite these two severe constraints, we may use language not only to explicitly speak about such things, but also to create what only can be demonstrated.
Robert Brandom’s work [5] may be well regarded as a further leap forward in the understanding of language and its practitioning. He proposes the inferentialist position, to which our positioning of the model is completely compatible. According to Brandom, we always have to infer a lot of things from received words during a discourse. We even have to signal that we expect those things to be inferred. The only thing what we can try in a languagebased interaction is to increasingly confine the degrees of freedom of possible models that are created in the interactees’ minds. Yet, achieving a certain state of resonance, or feeling that one understands each other, does NOT imply that the models are the identical. All what could be said is that the resonating models in the two interacting minds allow a certain successful prediction of the further course of the interaction. Here, we should be very clear about our understanding of the concept of model. You will find it in the chapters about the generalized model and the formal status of models (as a category).
Since Austin [6] it is wellknown that language is not equal to the series of graphical of phonic signals. The reason for this simply being that language is a social activity, both structural as well as performative. An illocutionary act is part of any utterance and any piece of text in a natural language, sometimes even in the case of a formal language. Yet, it is impossible to speak about that dimension in language.
A text is even more than a “series” of Austinian or Searlean speech acts. The reason for this is a certain aspect of embodiment: Only entities stuffed with memory can use language. Now, receiving a series of words immediately establishes a more or less volatile immaterial network in the “mind” of the receiving entity as well as in the “sending” entity. This network owns properties for which it is absolutely impossible to speak about, despite the fact that these networks represent somehow the ultimate purpose, or “essence”, of natural language. We can’t speak about that, we can’t explicate it, and we simply commit a categorical mistake if we apply logics and tools from logics like predicates in the attempt to understand it.
Logics and Language
These phenomena clearly proof that logics and language are different things. They are deeply incommensurable, despite the fact that they can’t be separated completely from each other, much like modeling and language. The structure of the world shows up in the structure of logics, as Wittgenstein mentioned. There are good reasons to take Wittgenstein serious on that. According to the Tractatus, the coupling between world and logics can’t be a direct one [7].
In contrast to the world, logics is not productive. “Novelty” is not a logical entity. Pure logics is a transcendental system about usage of symbols, precisely because any usage already would require interpretation. Logical predicates are nothing that need to be interpreted. These games are simply different games.
In his talk to the Jowett Society, Oxford, in 1923, Bertrand Russell, exhibiting an attitude quite different to that in the Principia and following much the line drawn by Wittgenstein, writes [p.88]:
Words such as “or” and “not” might seem at first sight, to have a perfectly precise meaning: “p or q'” is true when p is true, true when q is true, and false when both are false. But the trouble is that this involves the notions of “true” and “false”; and it will be found, I think, that all the concepts of logic involve these notions, directly or indirectly. Now “‘true” and “false” can only have a precise meaning when the symbols employed—words, perceptions, images, or what not—are themselves precise. We have seen that, in practice, this is not the case. It follows that every proposition that can be framed in practice has a certain degree of vagueness; that is to say, there is not one definite fact necessary and sufficient for its truth, but a certain region of possible facts, any one of which would make it true. And this region is itself illdefined: we cannot assign to it a definite boundary.
This is exactly what we meant before: “Precision” concerning logical propositions is not achievable as soon as we refer to symbols that we use. Only symbols that can’t be used are precise. There is only one sort of such symbols: transcendental symbols.
Mapping logics to language, as it happens so frequently and probably even as an acknowledged practice in linguistics in the treatment of vagueness, means to reduce language to logics. One changes the frame of reference, much like Zenon does in his selfgenerated pseudoproblems, much like Cantor1 [8] and his fellow Banach2 [9] did (in contrast to Dedekind3 [10]), or what Taylor4 did [11]. 3dimensionality produces paradoxes in a 2dimensional world, not only faulty projections. It is not really surprising that through the positivistic reduction of language to logics awkward paradoxes appear. Positivism implies violence, not only in the case linguistics.
We now can understand why it is almost silly to apply a truthvaluemethodology to the analysis of language. The problem of vagueness is not a problem, it is deeply in the blueprint of “language” itself. It is almost trivial to make remarks as Russell did [3, p.87]:
The fact is that all words are attributable without doubt over a certain area, but become questionable within a penumbra, outside which they are again certainly not attributable.
And it really should be superfluous to cite this 90year old piece. Quite remarkably it is not.
Language as a Practice
Wittgenstein emphasized repeatedly that language is a practice. Language is not a structure, so it is neither equivalent to logics nor to grammar, or even grammatology. In practices we need models for prediction or diagnosis, and we need rules, we frequently apply habits, which even may get symbolized.
Thus, we again may ask what is happening when we talk to each other. First, we exclude those models of which we now understand that they are not appropriate.
 – Logics is incommensurable with language.
 – Language, as well as any of its constituents, can’t be made “precise.”
As a consequence, language (and all of its constituents) is something that can’t be completely explicated. Large parts of language can only be demonstrated. Of course, we do not deny the proposal that a discourse reflects “propositional content,” as Brandom calls it ([5] chp. 8.6.2.). This propositional or conceptual content is given by the various kinds of models appearing in a discourse, models that are being built, inferred, refined, symbolized and finally externalized. As soon as we externalize a model, however, it is not language any more. We will investigate the dynamical route between concepts, logics and models in another chapter. Here and for the time being we may state that applying logics as a tool to language mistakes propositional content as propositional structure.
Again: What happens if I point to the white area up in the air before the blue background that we call sky, calling then “Oh, look a cloud!” ? Do I mean that there is an object called “cloud”? Even an object at all? No, definitely not. Claiming that there are “cloudconstituters,” that we do not measure exactly enough, that there is no proper thing we could call “cloud” (Unger), that our language has a defect etc., any purported “solution” of the problem [for an overview see 11] does not help to the slightest extent.
Anybody having made a mountain hike knows the fog in high altitudes. From lower regions, however, the same actual phenomenon is seen as a cloud. This provides us a hint, that the language game “cloud” also comprises information about the physical relational properties (position, speed, altitude) of the speaker.
What is going to happen by this utterance is that I invite my partner in discourse to interpret a particular, presumably shared sensory input and to interpret me and my interpretations as well. We may infer that the language game “cloud” contains a marker that is both linked to the structure and the semantics of the word, indicating that (1) there is an “object” without sharp borders, (2) no precise measurement should be performed. The symbolic value of “cloud” is such that there is no space for a different interpretation. Not the “object” is indicated by the word “cloud,” but a particular procedure, or class of procedures, that I as the primary speaker suggest when saying “Oh, there is a cloud.” By means of such procedures a particular style of modeling will be “induced” in my discourse partner, a particular way to actualize an operationalization, leading to such a representation of the signals from the external world that both partners are able to increase their mutual “understanding.” Yet, even “understanding” is not directed to the proposed object either. This scheme transparently describes the inner structure of what Charles S. Peirce called a “sign situation.” Neither accuracy, nor precision or vagueness are relevant dimensions in such kinds of mutually induced “activities,” which we may call a Peircean “sign.” They are completely secondary, a symptom of the use and of the openness.
Russell correctly proposes that all words in a language are vague. Yet, we would like to extend his proposal, by drawing on our image of thought that we explicate throughout all of our writings here. Elsewhere we already cited the Lagrangian trick in abstraction. Lagrange got aware about the power of a particular replacement operation: In a proposal or proposition, constants always can be replaced by appropriate procedures plus further constants. This increases generality and abstractness of the representation. Our proposal that is extending Russell’s insight is aligned to this scheme:
Different words are characterised (among other factors) by different procedures to select a particular class (mode) of interpretation.
Such procedures are precisely given as kind of models that are necessary besides those models implied in performing the interpretation of the actual phenomenon. The mode of interpretation comprises the selection of the scale employed in the operationalization, viz. measurement. Coarser scales imply a more profound underdetermination, a larger variety of possible and acceptable models, and a stronger feeling of vagueness.
Note that all these models are outside of language. To our opinion it does not make much sense to instantiate the model inside of language and then claiming a necessarily quite opaque “interpretation function,” as Burnett extensively demonstrates (if I understood her correctly). Our proposal is also more general (and more abstract) than Burnett’s, since we emphasize the procedural selection of interpretation models (note that models are not functions!). The necessary models for words like “taller,” “balder” or “cloudy” are not part of language and can’t be defined in terms of linguistic concepts. I would not call that a “cognitivist” stance, yet. We conceive it just as a consequence of the transcendental status of models. This proposal is linked to two further issues. First, it implies the acceptance of the necessity of models as a condition. In turn, we have to clarify our attitude towards the philosophical concept of the condition. Second, it implies the necessity of an instantiation, the actualization of it as the move from the transcendental to the applicable, which in turn invokes further transcendental concepts, as we will argue and describe here.
Saying this we could add that models are not confined to “epistemological” affairs. As the relation between language (as a practice) and the “generalized” model shows, there is more in it than a kind of “generalized epistemology.” The generalization of epistemology can’t be conceived as a kind of epistemology at all, as we will argue in the chapter about the choreosteme. The particular relation between language and model as we have outlined it should also make clear that “models” are not limited to the categorization of observables in the outer world. It also applies—now in more classic terms—to the roots of what we can know without observation (e.g. Strawson, p.112 in [12]). It is not possible to act, to think, or to know without implying models, because it is not possible to act, to think or to know without transformation. This gives rise to model as a category and to the question of the ultimate conditionability of language, actions, or knowing. In our opinion, and in contrast to Strawson’s distinction, it is not appropriate to separate “knowledge from observations” and “knowledge without observation.” Insisting on such a separation immediately would also drop the insight about mutual dependency of models, concepts, symbols and signs, among many other things. In short, we would fall back directly into the mystic variant of idealism (cf. Frege’s hyperplatonism), implying also some “direct” link between language and idea. We rate such a disrespect of the body, matter and mediating associativity as inappropriate and of little value.
It would be quite interesting to conduct a comparative investigation of the conceptual life cycle of pictorial information in contrast to textual information along the line opened by such a “processual indicative.” Our guess is that the textual “word” may have a quite interesting visual counterpart. But we have to work on this later and elsewhere.
Our extension also leads to the conclusion that “vague” is not a logical “opposite” of “accurate,” or of “precise” either. Here we differ (not only) from Bertrand Russell’s position. So to speak, the vagueness of language applies here too. In our perspective, “accurate” simply symbolizes the indicative to choose a particular class of models that a speaker suggests to the partner in discourse to use. Nothing more, but also nothing less. Models can not be the “opposite” of other models. Words (or concepts) like “vague” or “accurate” just explicate the necessity of such a choice. Most of the words in a language refer only implicitly to that choice. Adjectives, whether absolute or relative, are bivalent with respect to the explicity or impliciteness of the choice of the procedure, just depending on the context.
For us it feels quite nice to discover a completely new property of words as they occur in natural languages. We call it “processual indicative.” A “word” without such a processual indicative on the structural level would not be a “word” any more. Either it reduces to a symbol, or even an index, or the context degenerates from a “natural” language (spoken and practiced in a community) into a formal language. The “processual indicative” of the language game “word” is a grammatical property (grammar here as philosophical grammar).
Nuisance, Flaws, and other Improprieties
Charles S. Peirce once mentioned, in a letter around 1908, that is well after his major works, and answering a question about the position or status of his own work, that he tends to label it as idealistic materialism. Notably, Peirce founded what is known today as American pragmatism. The idealistic note, as well as the reference to materialism, have to be taken extremely abstract in order to justify such. Of course, Peirce himself has been able for handling such abstract levels.
Usually, however, idealism and pragmatism are in a strong contradiction to each other. This is especially true when it comes to engineering, or more generally, to the problematics of the deviation, or the problematics posed by the deviation, if you prefer.
Obviously, linguistics is blind or even selfdeceptive against their domainspecific “flaw,” the vagueness. Linguists are treating vagueness as a kind of flaw, or nuisance, at least as a kind of annoyance that needs to be overcome. As we already mentioned, there are many incommensurable proposals how to overcome it, except one: checking if it is a flaw at all, and which conditions or assumptions lead to the proposal that vagueness is indeed a flaw.
Taking only 1 step behind, it is quite obvious that logical positivism and its inheritance is the cause for the flaw. The problem “appeared” in the early 1960ies, when positivism was prevailing. Dropping the assumptions of positivism also removes the annoyance of vagueness.
Engineering a new device is a demanding task. Yet, there are two fundamentally different approaches. The first one, more idealistic in character, starts with an analytic representation, that is, a formula, or more likely, a system of formulas. Any influence that is not covered by that formula is either shifted into the premises, or into the socalled noise: influences, about nothing “could” be known, that drive the system under modeling into an unpredictable direction. Since this approach starts with a formula, that is, an analytic representation, we also can say that it starts under the assumption of representability, or identity. In fact, whenever you find designers, engineers or politicians experience to speak about “disturbances,” it is more than obvious that they follow the idealistic approach, which in turn follows a philosophy of identity.
The second approach is very different from the first one, since it does not start with identity. Instead, it starts with the acknowledgement of difference. Pragmatic engineering does not work instead of nuisances, it works precisely within and along nuisances. Thus, there is no such thing as a nuisance, a flaw, an annoyance, etc. There is just fluctuation. Instead of assuming the structural constant labeled as “ignorance,” as represented by the concept of noise, there is a procedure that is able to digest any fluctuation. A “disturbance” is nothing that can be observed as such. Quite in contrast, it is just and only a consequence of a particular selection of a purpose. Thus, pragmatic engineering leads to completely different structure that would be generated under idealistic assumptions. The difference between both remains largely invisible in all cases where the information part is neglectable (which actually is never the case), but it is vital to consider it in any context where formalization is dealing with information, whether it is linguistics or machinelearning.
The issue relates to “cognition” too, understood here as the naively and phenomenologically observable precipitation of epistemic conditions. From everyday experience, but also as a researcher in “cognitive sciences”, we know, i.e. we could agree on the proposal that cognition is something that is astonishing stable. The traditional structuralist view, as Smith & Jones call it [13], takes this stability as a starting point and as the target of the theory. The natural consequence is that this theory rests on the apriori assumption of a strict identifiability of observable items and of the result of cognitive acts, which are usually called concepts and knowledge. In other words, the idea that knowledge is about identifiable items is nothing else than a petitio principii: Since it serves as the underlying assumption it is no surprise that the result in the end exhibits the same quality. Yet, there is a (not so) little problem, as Smith & Jones correctly identified (p.184/185):
The structural approach pays less attention to variability (indeed, under a traditional approach, we design experiments to minimize variability) and not surprisingly, it does a poor job explaining the variability and context sensitivity of individual cognitive acts. This is a crucial flaw. […]
Herein lies our discontent: If structures control what is constant about cognition, but if individual cognitive acts are smartly unique and adaptive to the context, structures cannot be the cause of the adaptiveness of individual cognitions. Why, then, are structures so theoretically important? If the intelligenceand the cause of realtime individual cognitive actsis outside the constant structures, what is the value of postulating such structures?
The consequence the authors draw is to conceive cognition as process. They cite the work of Freeman [14] about the cognition of smelling
They found that different inhalants did not map to any single neuron or even group of neurons but rather to the spatial pattern of the amplitude of waves across the entire olfactory bulb.
The heir of being affected by naive phenomenology (phenomenology is always naive) and its main pillar of “identifiability of X as X” obviously leads to conclusions that are disastrous for the traditional theory. It vanishes.
Given these difficulties, positivists are trying to adapt. Yet, people still dream of semantic disambiguation as a mechanical technique, or likewise, dream (as Fregean worshipers) of eradicate vagueness from language by trying to explain it away.
One of the paradoxes dealt with over and over again is the already mentioned Sorites (Greek for “heap”) paradox. When is a heap a heap? Closely related to it are constructions like Wang’s Paradox [15]: If n is small, then n+1 is also small. Hence there is no number that s not small. How to deal with that?
Certainly, it does not help to invoke the famous “context dependency” as a potential cure. Jaegher and Rooij recently wrote [16]:
“If, as suggested by the Sorites paradox, negrainedness is important, then a vague language should not be used. Once vague language is used in an appropriate context, standard axioms of rational behaviour are no longer violated.”
Yet, what could appropriate mean? Actually, for an endeavor as Jaegher and Rooij have been starting the appropriateness needs to be determined by some means that could not be affected by vagueness. But how to do that for language items? They continue:
“The rationale for vagueness here is that vague predicates allow players to express their valuations, without necessarily uttering the context, so that the advantage of vague predicates is that they can be expressed across contexts.”
At first sight, this seems plausible. Now, any part of language can be used in any context, so all the language is vague. The unfavorable consequence for Jaegher & Rooij being that their attempt is not even a selfdisorganizing argument, it has the unique power of being selfvanishing, their endeavor of expelling vagueness is doomed to fail before they even started. Their main failure is, however, that they take the apriori assumption for granted that vagueness and crispness are “real” entities that are somehow existing before any perception, such that language could be “infected” or affected with it. Note that this is not a statement about linguistics, it is one about philosophical grammar.
It also does not help to insist on “tolerance”. Rooij [17] recently mentioned that “vagueness is crucially related with tolerant interpretation”. Rooij desperately tries to hide his problem, the expression “tolerant interpretation” is almost completely empty. What should it mean to interpret something tolerantly as X? Not as X? Also a bit as Y? How then would we exchange ideas and how could it be that we agree exactly on something? The problem is just move around a corner, but not addressed in any reasonable manner. Yet, there is a second objection to “tolerant interpretation”.
Interpretation of vague terms by a single entity must always fail. What is needed are TWO interpretations that are played as negotiation in language games. Two entities, whether humans or machines, have to agree, i.e. they also have to be able to perform the act of agreeing, in order to resolve vagueness of items in language. It is better to drop vagueness all together and simply to say that at least two entities are necessarily be “present” to play a language game. This “presence” is , of course, an abstract semiotic one. It is given in any Peircean sign situation. Since signs refer only and always just to other signs vagueness is, in other words, not a difficulty that need to be “tolerated”.
Dummett [15] spent more than 20 pages for the examination of the problem of vagueness. Up to date it is one of the most thorough ones, but unfortunately not received or recognized by contemporary linguistics. There is still a debate about it, but no further development of it. Dummett essentially proofs that vagueness is a not a defect of language, it is a “design feature”. First, he proposes a new logical operator “definitely” in order to deal with the particular quality of indeterminateness of language. Yet, it does not remove vagueness or its problematic, “that is, the boundaries between which acceptable sharpenings of a statement or a predicate range are themselves indefinite.” (p.311)
He concludes that “vague predicates are indispensable”, they are not eliminable in principle without loosing language itself. Tolerance does not help as much selecting “appropriate contexts” fails to do, both proposed to get rid of a problem. What linguists propose (at least those adhering to positivism, i.e. nowadays nearly all of them) is to “carry around a colourchart, as Wittgenstein suggested in one
of his example” (Dummett). This would turn observational terms into legitimated ones by definition. Of course, the “problem” of vagueness would vanish, but along with it also any possibility to speak and to live. (Any apparent similarity to real persons, politicians or organizations such like the E.U. is indeed intended.)
Linguistics, and cognitive sciences as well, will fail to provide any valuable contribution as long as they apply the basic condition of the positivist attitude: that subjects could be separated from each other in order to understand the whole. The whole here is the Lebensform working underneath, or beyond (Foucault’s field of proposals, Deleuze’s sediments), connected cognitions. It is almost ridicule to try to explain anything regarding language within the assumption of identifiability and applicability of logics.
Smith and Jones close their valuable contribution with the following statement, abandoning the naive realismidealism that has been exhibited so eloquently by Rooij and his coworkers nearly 20 years later:
On a second level, we questioned the theoretical frameworkthe founding assumptionsthat underlie the attempt to define what “concepts really are.” We believe that the data on developing novel word interpretationsdata showing the creative intelligence of dynamic cognitionseriously challenge the view of cognition as represented knowledge structures. These results suggest that perception always matters in a deep win: Perception always matters because cognition is always adaptive to the hereandnow, and perception is our only means of contact with the hereand now reality.
There are a number of interesting corollaries here, which we will not follow here. For instance, it would be a categorical mistake to talk about noise in complex systems. Another consequence is that engineering, linguistics or philosophy that is based on the apriori concept of identity is not able to make reasonable proposals about evolving and developing systems, quite in contrast to a philosophy that starts with difference (as a transcendental category, see Deleuze’s work, particularly [18]).
We now can understand that idealistic engineering is imposing its adjudgements ways too early. Consequently, idealistic engineering is committing the naturalistic fallacy in the same way as many linguistics is committing it, at least as far as the latter starts with the positivistic assumption of the possibility of positive assumptions such as identifiability. The conclusion for the engineering of machinebased episteme is quite obvious: we could not start with identified or even identifiable items, and where it seems that we meet them, as in the case of words, we have to take their identifiability as a delusion or illusion. We also could say that the only feasible input for a machine that is supposed to “learn” is made from vague items for which there is only a probabilistic description. Even more radical, we can see that without fundamentally embracing vagueness no learning is possible at all. That’s now the real reason for the failure of “strong” or “symbolic” AI.
Conclusions for Machinebased Epistemology
We started with a close inspection and a critique of the concept of vagueness and ended up in a contribution to the theory of language. Once again we see that language is not just about words, symbols and grammar. There is much more in it and about it that we must understand to bring language into contact with (brain) matter.
Our results clearly indicate, against the mainstream in linguistics and large parts of (mainly analytic) philosophy, that words can’t be conceived as parts of predicates, i.e. clear proposals, and language can’t be used as a vehicle for the latter. This again justifies an initial probabilistic representation of those grouped graphemes (phonemes) as they can be taken from a text, and which we call “words.” Of course, the transition from a probabilistic representation to the illusion of propositions is not a trivial one. Yet, it is not words that we can see in the text, it is just graphemes. We will investigate the role and nature of words at some later point in time (“Waves, Words, and Images”, forthcoming).
Secondly we discovered a novel property or constituent of words, which is a selection function (or a class thereof) which indicates the style of interpretation regarding the implied style of presumed measurement. We called it processual indicative. Such a selection results in the invoking of clearcut relations or boundaries, or indeterminable ones. Implementing the understanding of language necessarily has to implement such a property for all of the words. In any of the approaches known so far, this function is nonexistent, leading to serious paradoxes and disabilities.
A quite nice corollary of these results is that words never could be taken as a reference. It is perhaps more appropriate to conceive of words as symbols for procedural packages, recipes and prescription on how to arrange certain groups of models. Taken such, van Fraassen’s question on how words acquire reference is itself based on a drastic misunderstanding, deeply informed by positivism (remember that it was van Fraassen who invented this weird thing called supervaluationism). There is no such “reference.” Instead, we propose to conceive of words as units consisting from (visible) symbols and a “Lagrangean” differential part. This new conception of words remains completely compatible with Wittgenstein’s view on language as a communal practice; yet, it avoids some difficulties, Wittgenstein has struggled with throughout his life. The core of it may be found in PI §201, describing the paradox of rule following. For us, this paradox simply vanishes. Our model of words as symbolic carriers of “processual indicatives” also sheds light to what Charles S. Peirce called a “sign situation,” being not able to elucidate the structure of “signs” any further. Our inferentialist scheme lucidly describes the role of the symbolic as a quasimaterial anchor, from which we can proceed via models as targets of the “processual indicative” to the meaning as a mutually ascribed resonance.
The introduction of the “processual indicative” also allows to understand the phenomenon that despite the vagueness of words and concepts it is possible to achieve very precise descriptions. The precision, however, is just a “feeling” as it is the case for “vagueness,” dependent on a particular discursive situation. Larger amounts of “social” rules that can be invoked to satisfy the “processual indicative” allow for more precise statements. If, however, these rules are indeterminate by themselves quite often more or less funny situation may occur (or disastrous misunderstandings as well).
The main conclusion, finally, is referring to the social aspect of a discourse. It is largely unknown how two “epistemic machines” will perceive, conceive of and act upon each other. Early experiments by Luc Steels involved mini robots that have been far too primitive to draw any valuable conclusion for our endeavor. And Stanislav Lem’s short story “Personetics”[19] does not contain any hint about implementational issues… Thus, we first have to implement it…
Notes
1. One of Cantor’s paradoxes claims that a 2dimensional space can be mapped entirely onto a 1 dimensional space without projection errors, or overlaps. All of Cantor’s work is “absurd,” since it mixes two games that apriori have been separated: countability and noncountability. The dimensions paradox appears because Cantor conceives of real numbers as determinable, hence countable entities. However, by his own definition via the Cantor triangle, real numbers are supracountable infinite. Real numbers are not determinable, hence they can’t be “reordered,” or put along a 1dimensional line. Its a “silly” contradiction. We conclude that such paradoxes are pseudoparadoxes.
2. The BanachTarski (BT) pseudoparadox is of the same structure as the dimensional pseudoparadox of Cantor. The surface of a sphere is broken apart into a finite number of “individual” pieces; yet , those pieces are not of determinate shape. Then BT proof that from the pieces of 1 sphere 2 spheres can be created. No surprise at all: the pieces are not of determinate shape, they are complicated: they are not usual solids but infinite scatterings of points. It is “silly” first to speak about pieces of a sphere, but then to dissolve those pieces into Cantor dust. Countability and incountability collide. Thus there is no coherence, so they can be any. The BT paradox is even wrong: from such premises an infinite number of balls could be created from a single ball, not just a second one.
3. Dedekind derives natural numbers as actualizations from their abstract uncountable differentials, the real numbers.
4. Taylor’s paradox brings scales into conflict. A switch is toggled repeatedly after a decreasing period of time, such that the next period is just half of the size of the current one. After n toggling events (n>>), what is the state of the switch? Mathematically, it is not defined (1 AND 0), statistically it is 1/2. Again, countability, which implies a physical act, ultimately limited by the speed of light, is contrasted by infinitely small quantities, i.e. incountability. According to Gödel’s incompleteness, for any formal system it is possible to construct paradoxes by putting up “silly” games, which do not obey to the selfimposed apriori assumptions.
This article has been created on Dec 29th, 2011, and has been republished in a considerably revised form on March 23th, 2012.
References
 [1] Heather Burnett, The Puzzle(s) of Absolute Adjectives – On Vagueness, Comparison, and the Origin of Scale Structure. Denis Paperno (ed). “UCLA Working Papers in Semantics,” 2011; version referred to is from 20.12.2011. available online.
 [2] Brian Weatherson (2009), The Problem of the Many. Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. available online, last access 28.12.2011.
 [3] Peter Unger (1980), The Problem of the Many.
 [4] Bertrand Russell (1923): Vagueness, Australasian Journal of Psychology and Philosophy, 1(2), 8492.
 [5] Robert Brandom, Making it Explicit. 1994.
 [6] John Austin. Speech act Theory.
 [7] Colin Johnston (2009). Tractarian objects and logical categories. Synthese 167: 145161.
 [8] Cantor
 [9] Banach
 [10] Dedekind
 [11] Taylor
 [12] Peter Strawson, Individuals: An Essay in Descriptive Metaphysics. Methuen, London 1959.
 [13] Linda B. Smith, Susan S. Jones (1993). Cognition Without Concepts. Cognitive Development, 8, 181188. available here.
 [14] Freeman, W.J. (1991). The physiology of perception. Scientific American. 264. 7885.
 [15] Michael Dummett, Wang’s Paradox (1975). Synthese 30 (1975) 301324. available here.
 [16] Kris De Jaegher, Robert van Rooij (2011). Strategic Vagueness, and appropriate contexts. Language, Games, and Evolution, Lecture Notes in Computer Science, 2011, Volume 6207/2011, 4059, DOI: 10.1007/9783642180064_3
 [17] Robert van Rooij (2011). Vagueness, tolerance and nontransitive entailment in Understanding Vagueness – Logical, Philosophical and Linguistic Perspectives, Petr Cintula, Christian Fermuller, Lluis Godo, Petr Hajek (eds.), College Publications, 2011.
 [18] Gilles Deleuze, Difference and Repetition.
 [19] Stanislav Lem, Personetics. reprinted in: Douglas Hofstadter, The Minds I.
۞