February 28, 2012 § Leave a comment
There are good reasons to think that data appear
as the result of friendly encounters with the world.
Originally, “data” has been conceived as the “given”, or as things that are given, if we follow the etymological traces. That is not quite surprising since it is closely related to the concept of date as a point in time. And what if not time could be something that is given? The concept of date is, on the the other, related to the computation, at least, if we consider etymology again. Towards the end of the medieval ages, the problems around the calculation of the next Easter date(s) triggered the first institutionalized recordings of rule-based approaches that have been called “computation.” At those times, it already has been a subject for specialists…
Yet, the cloud of issues around data also involves things. But “things” are nothing that are invariably given, so to speak as a part of an independent nature. In Nordic languages there is a highly interesting link to constructivism. Things originally denoted some early kind of parliament. The Icelandic “alþingi”, or transposed “Althingi” is the oldest parliamentary institution in the world still extant, founded in 930. If we take this thread further it is clear that things refer to entities that have been recognized by the community as subject for standardization. That’s the job of parliaments or councils. Said standardization comprises the name, rules for recognizing it, and rules for using or applying it, or simply, how to refer to it, e.g. as part of a semiosic process. That is, some kind of legislation, or norming, if not to say normalization. (That’s not a bad thing in itself, only if a society is too eager in doing so, standardization is a highly relevant condition for developing higher complexity, see here) And, back to the date, we fortunately know also about a quite related usage of the “date” as in “dating” or to make a date, in other words, to fix the (mostly friendly) issues with another person…
The wisdom of language, as Michel Serres once coined it (somewhere in his Hermes series, I suppose) knew everything, it seems. Things are not, because they remain completely beyond even any possibility to perceive them if there is no standard to treat the differential signals it provides. This “treatment” we usually call interpretation.
What we can observe here in the etymological career of “data” is nothing else than a certain relativization, a de-centering of the concept away from the absolute centers of nature, or likewise the divine. We observe nothing else than the evolution of a language game into its reflected use.
This now is just another way to abolish ontology and its existential attitude, at least as far as it claims an “independent” existence. In order to become clear about the concept of data, what we can do about it, or even how to use data, we have to arrive at a proper level of abstraction, that to understand is not a difficult thing in itself.
This, however, also means that “data processing” can’t be conceived in the way as we conceive, for instance, the milling of grain. Data processing should me taken much more as a “data thinging” than as a data milling, or data mining. There is deep relativity in the concept of data, because it is always an interpretation that creates them. It is nonsense to naturalize them in the infamous equation “information=data+meaning”, we already discussed that in the chapter about information. Yet, this process probably did not reach its full completion, especially not in the discipline of so-called computer “sciences”. Well, every science started as some kind of Hermetism or craftmenship…
Yet, one still might say that at a given point at time we come upon encoded information, we encounter some written, stored, or somehow else materially represented structured differences. Well, ok, that’s true. However, and that’s a big however: We still can NOT claim that the data is something given.
This raises a question: what are we actually doing when we say that we “process” data? At first sight, and many people think so, that this processing data produces information. But again, it is not a processing in the sense of milling. This information thing is not the result of some kind of milling. It needs constructive activities and calls for affected involvement.
Obviously, the result or the produce of processing data is more data. Data processing is thus a transformation. Probably it is appropriate to say that “data” is the language game for “transforming the possibility for interpretation into its manifold.” Nobody should wonder about the fact that there are more and more “computers” all the time and everywhere. Besides the fact that the “informationalization” of any context allows for a improved generality as well as for improved accuracy (they excluded each other in the mechanical age), the conceptual role of data itself produces an built-in acceleration.
Let us leave the trivial aspects of digital technology behind, that is, everything that concerns mere re-arrangement and recombination without loosing and adding anything. Of course, creating a pivot table may lead to new insights since we suddenly (and simply) can relate things that we couldn’t without pivoting. Nevertheless, it is mere re-arrangement, despite it is helpful, of course. It is clear that pivoting itself does not produce any insight, of course.
Our interest is in machine-based episteme and its possibility. So, the natural question is: How to organize data and its treatment such that machine-based episteme is possible? Obviously this treatment has to be organized and developed in a completely autonomous manner.
In so-called data mining, which only can be considered as a somewhat childish misnomer, people often report that they spend most of the time in preparing data. Up to 80% of the total project time budget is spent for “preparing data”. Nothing else cold render the inappropriate concepts behind data mining more visible than this fact. But one step at a time…
The input data to machine learning are often considered to be extremely diverse. At first place, we have to distinguish between structured and unstructured data, secondly, we unstructured qualities like text or images or the different scales of expression.
Table 1: Data in the Quality Domain
|structured data||things like tables, or schemes, or data that could be brought into that form in one way or another; often related to physical measurement devices or organizational issues (or habits)|
|unstructured data —-||entities that can’t be brought into a structured form before processing them in principle. It is impossible to extract the formal “properties” of text before interpreting it; those properties we would have to know before being able to set up any kind of table into which we could store our “measurement”. Hence, unstructured data can’t be “measured”. Everything is created and constructed “on-the-fly”, sailing while building the raft, as Deleuze (Foucault?) put it once. Any input needs to be conceived as and presented to the learning entity in a probabilized form.|
Table 1: Data in the Scale Domain
|real-valued scale||numeric, like 1.232; mathematically: real numbers, (ir)rational numbers, etc. infinitely different values|
|ordinal scale||enumerations, ordering, limited to a rather small set of values, typically n<20, such like 1,2,3,4; mathematically: natural numbers, integers|
|nominal scale||singular textual tokens, such like “a”, “abc”, “word”|
|binary scale||only two values are used for encoding, such as 1,0, or yes,no etc.|
Often it is proposed to regard the real-valued scale as the most dense one, hence it is the scale that could be expected to transport the largest amount of information. Despite the fact that this is not always true, it surely allows for a superior way to describe the risk in modeling.
That’s not all of course. Consider for instance domains like the financial industry. Here, all the data are marked by a highly relevant point of anisotropy regarding the scale: the zero. As soon something becomes negative, it belongs to a different category, albeit it could be quite close to another value if we consider just the numeric value. It is such domain specific issues that contribute to the large efforts people spend to the preparation of data. It is clear that any domain is structured by and knows about lot of such “singular” points. People then claim that they have to be a specialist in the respective domain in order to be able to prepare the data.
Yet, that’s definitely not true, as we will see.
In order to understand the important point we have to understand a further feature of data in the context of empirical analysis.Remember, that in empirical analysis we are looking primarily for a mapping function, which transforms values from measurement into values of a prediction or diagnosis, in short, into the values that describe the outcome. In medicine we may measure physiological data in order to achieve a diagnosis, and doing so is almost identical as other people perform measures in an organization.
Measured data can be described by means of a distribution. A distribution simply describes the relative frequency of certain values. Let us resort to the following two. examples. Here you see simply frequency histograms, where each bin reflects the relative frequency of the values falling into the respective bin.
What is immediately striking is that both are far from the analytical distributions like the normal distribution. They are both strongly rugged, far from being smooth. What we can see also: they have more than one peak, even as it is not clear how many peaks there are.
Actually, in data analysis one meets such conditions quite often.
Figure 1a. A frequency distribution showing (at least) two modes.
Figure 1b. A sparsely filled frequency distribution
So, what to do with that?
First, the obvious anisotropy renders any trivial transformation meaningless. Instead, we have to focus precisely those inhomogeneities. In a process perspective we may reason that the data that have been measured by a single variable actually are from at least two different processes, or that the process is non-stationary and switches between (at least two) different regimes. In either case, we split the variable into two, applying a criterion that is intrinsic to the data. This transformation is called deciling, and it is probably the third-most important transformation that could be applied to data.
Well, let us apply deciling to data shown inFigure 1a.
Figure 2a,b: Distributions after deciling a variable V0 (as of Figure 1a) into V1 and V2. The improved resolution for the left part is not shown.
The result is three variables, and each of them “expresses” some features. Since we can treat them (and the values comprised) independently, we obviously constructed something. Yet, we did not construct a concept, we just introduced additional potential information. At that stage, we do not know whether this deciling will help to build a better model.
Variable V1 (Figure 2a (left part ) ) can be transformed further, by shifting the value to the right through applying a log-transformation. A log-transformation increases the differences between small values and decreases the differences between large values, and it does so in a continuous fashion. As a result, the peak of the distribution will move more to the right (and it will also be less prominent). Imagine a large collection of bank accounts, most of them filled with amounts between 1’000 and 20’000, while some host 10’00’000. If we map all those values onto the same width, the small amounts can’t be well distinguished any more, and we have to do that mapping, called linear normalization, with all our variables in order to make variances comparable. It is mandatory to transform such left-skewed distributions into a new variable in order to access the potential information represented by it. Yet, as always in data analysis, before we didn’t complete the whole modeling cycle down to validation we can not know whether a particular transformation will have any or even a positive effect for the power of our model.
The log transformation has a further quite neat feature: it is defined only for positive values. Thus, is we apply a transformation that creates negative values for some of the observed values and subsequently apply a log-transform, we create missing values. In other words, we disregard some parts of the information that originally has been available in the data. So, a log-transform can be used to
- – render items discernible in left-skewed distributions, and to
- – blend out parts of information dedicatedly by a numeric transformation.
These two possible achievements make the log-transform one of the most frequently applied.
The most important transformation in predictive modeling is the construction of new variables by combining a small number (typically 2) of hitherto available ones, either analytically by some arithmetics, or more generally, any suitable mapping, inclusive the SOM, from n variables to 1 variable. Yet, this will be discussed at a later point (in another chapter, for an overview see here). The trick is to find the most promising of such combinations of variables, because obviously the number of possible combinations is almost infinitely large.
Anyway, the transformed data will be subject to an associative mechanism, such like the SOM. Such mechanism are based on the calculation of similarities and the comparison of similarity values. That is, the associative mechanism does not consider any of the tricky transformations, it just reflects the differences in the profiles (see here for a discussion of that).
Up to this point the conclusion is quite clear. Any kind of data preparation just has to improve the distinguishability of individual bits. Since we anyway do not know anything about the structure of the relationship between measurement, the prediction and the outcome we try to predict, there is nothing else we could do in advance. On the second line this means that there is no need to import any kind of semantics. Now remember that transforming data is an analytic activity, while it is the association of things that is a constructive activity.
There is a funny effect of this principle of discernibility. Imagine an initial model that comprises two variables v-a and v-b, among some others, for which we have found that the combination a*b provides a better model. In other words, the associative mechanism found a better representation for the mapping of the measurement to the outcome variable. Now first remember that all values for any kind of associative mechanism has to be scaled to the interval [0..1]. Multiplying two sets of such values introduces a salient change if both values are small or if both values are large. So far, so good. The funny thing is that the same degree of discernibility can be achieved by the transformative coupling v-a/v-b, by the division. The change is orthogonal to that introduced by the multiplication, but that is not relevant for the comparison of profiles. This simple effect nicely explains a “psychological” phenomenon… actually, it is not psychological but rather an empiric one: One can invert the proposal about a relationship between any two variables without affecting the quality of the prediction. Obviously, it is rather not the transformative function as such that we have to consider as important. Quite likely, it is the form aspect of the data space warping qua transformation that we should focus on.
All of those transformation efforts exhibit two interesting phenomena. First, we apply them all as a hypothesis, which describes the relation between data, the (more or less) analytic transformation, the associative mechanism, and the power of the model. If we can improve the power of the model by selecting just the suitable transformations, we also know which transformations are responsible for that improvement. In other words, we carried out a data experiment, that, and that’s the second point to make here, revealed a structural hypothesis about the system we have measured. Structural hypotheses, however, could qualify as pre-cursors of concepts and ideas. This switching forth and back between the space of hypotheses H and the space of models (or the learning map L, as Poggio et al.  call it)
Thus we end up with the insight that any kind of data preparation can be fully automated, which is quite contrary to the mainstream. For the mere possibility of machine-based episteme it is nevertheless mandatory. Fortunately, it is also achievable.
One (or two) last word on transformations. A transformation is nothing else than a method, and importantly, vice versa. This means that any method is just: a potential transformation. Secondly, transformations are by far, and I mean really by far, more important than the choice of the associative method. There is almost no (!) literature about transformations, and almost all publications are about the proclaimed features of a “new” method. Such method hell is dispensable. The chosen method just needs to be sufficiently robust, i.e. it should not—preferably: never—introduce a method-specific bias or, alternatively, it should allow to control as much of its internal parameters as possible. Thus we chose the SOM. It is the most transparent and general method to associate data into groups for establishing the transition from extensions to intensions.
Besides the choice of the final model, the construction of a suitable set of transformation is certainly one of the main jobs in modeling.
Automating the Preparation of Data
How to automate the preparation of data? Fortunately, this question is relatively easy to answer: by machine-learning.
What we need is just a suitable representation of the problematics. In other words, we have to construct some properties that together potentially describe the properties of the data, especially the frequency distribution.
We have made good experiences by applying curve fitting to the distribution in order to create the fingerprint that describe the properties of the values represented by a variable. For instance, a 5-th order polynomial, together with an negative exponential and a harmonic fit (trigonometric functions) are essential for such a fingerprint (don’t forget the first derivatives, and the deviation from the models). Further properties are the count and location of empty bins. The resulting vector typically comprises some 30 variables and thus contains enough information for learning the appropriate transformation.
We have seen that the preparation of data can be automated. Only very few domain-specific rules are necessary to be defined apriori, such as the anisotropy around zero for the financial domain. Yet, the important issue is that they indeed can be defined apriori, outside the modeling process, and fortunately, they are usually quite well-known.
The automation of the preparation of data is not an exotic issue. Our brain does it all the time. There is no necessity for an expert data-mining homunculus. Referring to the global scheme of targeted modeling (in the chapter about technical aspects) we now have completed the technical issues for this part. Since we already handled the part of associative storage, “only” two further issues on our track towards machine-based episteme remain: the issue of the emergence of ideas and concepts, and secondly, the glue between all of this.
From a wider perspective we definitely experienced the relativity of data. It is not appropriate to conceive data as “givens”. Quite in contrast, they should be considered as subject for experimental re-combination, as kind of an invitation to transform them.
Data should not be conceived as a result of experiments or measurements, some kind of immutable entities. Such beliefs are directly related to naive realism, to positivism or the tradition of logical empiricism. In contrast, data are the subject or the substrate of experiments of their own kind.
Once the purpose of modeling is given, the automation of modeling thus is possible. Yet, this “purpose” can be first quite abstract, and usually it is something that results from social processes. It is a salient and an open issue, not only for machine-based episteme, how to create, select or achieve a “purpose.”
Even as it still remains within the primacy of interpretation, it is not clear so far whether targeted modeling can contribute here. We guess, not so much, at least not for its own. What we obviously need is a concept for “ideas“.
-  Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee1 & Partha Niyogi (2004). General conditions for predictivity in learning theory. Nature 428: 419-422 (25 March 2004).