## Dealing with a Large World

June 10, 2012 § Leave a comment

The world as an imaginary totality of all actual and virtual

relationships between assumed entities can be described in innumerable ways. Even what we call a “characteristic” forms only in a co-dependent manner together with the formation processes of entities and relationships. This fact is particularly disturbing if we encounter something for the first time, without the guidance provided by more or less applicable models, traditions, beliefs or quasi-material constraints. Without those means any selection out of all possible or constructible properties is doomed to be fully contingent, subject to pure randomness.

Yet, this does not result in results that are similarly random. Given that the equipment with tools and methods is given for a task or situation at hand, modeling is for the major part the task to reduce the infiniteness of possible selections in such a way that the resulting representation can be expected to be helpful. Of course, this “utility” is not a hard measure in itself. It is not only dependent on the subjective attitude to risk, mainly the model risk and the prediction risk, utility is also relative to the scale of the scope, in other words, whether one is interested in motor or other purely physical aspects, tactical aspects or strategic aspects, whether one is interested in more local or global aspects, both in time and space, or whether one is interested in any kind of balanced mixture of those aspects. Establishing such a mixture is a modeling task in itself, of course, albeit one that is often accomplished only implicitly.

The randomness mentioned above is a direct corollary of the empirical underdetermination1. From a slightly different perspective, we also may say that it is an inevitable consequence of the primacy of interpretation. And we also should not forget that language and particularly metaphors in language—and any kind of analogical thinking as well—are means to deal constructively with that randomness, turning physical randomness into contingency. Even within the penultimate guidance of predictivity—it is only a soft guidance though—large parts of what we reasonably could conceive as facts (as temporarily fixed arrangement of relations) is mere collaborative construction, an ever undulating play between the individual and the general.

Even if analogical thinking indeed is the cornerstone, if not the Acropolis, of human mindedness, it is always preceded by and always rests upon modeling. Only a model allows to pick some aspect out of the otherwise unsorted impressions taken up from the “world”. In previous chapters we already discussed quite extensively the various general as well as some technical aspects of modeling, from an abstract as well as from a practical perspective.2 Here we focus on a particular challenge, the selection task regarding the basic descriptors used to set up a particular model.

Well, given a particular modeling task we have the practical challenge to reduce a large set of pre-specific properties into a small set of “assignates” that together represent in some useful way the structure of the dynamics of the system that we’d observed. How to reduce a set of properties created by observation that comprises several hundreds of them?

The particular challenge arises even in the case of linear systems if we try to avoid subjective “cut-off” points that are buried deeply into the method we use. Such heuristic means are wide-spread in statistically based methods. The bad thing about that is that you can’t control their influence onto the results. Since the task comprises the selection of properties for the description of the entities (prototypes) to be formed, such arbitrary thresholds, often justified or even enforced just by the method itself, will exert a profound influence on the semantic level. In other words it corroborates its own assumption of neutrality.

Yet, we also never should assume linearity of a system, because most of the interesting real systems are non-linear, even in the case of trivial machines. Brute force approaches are not possible, because the number of possible models is 2^n, with n the number of properties or variables. Non-linear models can’t be extrapolated from known ones, of course. The Laplacean demon3 became completely wrapped by Thomean folds4, being even quite worried by things like Turing’s formal creativity5.

When dealing with observations from “non-linear entities”, we are faced with the necessity to calculate and evaluate any selection of variables explicitly. Assuming a somewhat phantastic figure of 0.0000001 seconds (10e-6) needed to calculate a single model, we still would need 10E15 years to visit all models if we would have to deal with just 100 variables. To make it more palpable: It would take 80 million times longer than the age of the earth, which is roughly 4.8 billion years…

Obviously, we have to drop the idea that we can “proof” the optimality of a particular model. The only thing we can do is to minimize the probability that within a given time T we can find a better model. On the other hand, the data are not of unbounded complexity, since real systems are not either. There are regularities, islands of stability, so to speak. There is always some structure, otherwise the system would not persist as an observable entity. As a consequence, we can organize the optimization of “failure time probability”, we may even consider this as a second-order optimization. We may briefly note that the actual task thus is not only to select a proper set of variables, we also should identify the relations between the observed and constructed variables. Of course, there are always several if not many sets of variables that we could consider as “proper”, precisely for the reason that they form a network of relations, even if this network is probabilistic in nature and itself being kind of a model.

So, how to organize this optimization? Basically, everything has to be organized as nested, recurrent processes. The overall game we could call learning. Yet, it should be clear that every “move” and every fixation of some parameter and its value is nothing else than a hypothesis. There is no “one-shot-approach”, and no linear progression either.

If we want to avoid naive assumptions—and any assumption that remains untested is de facto a naive assumption—we have to test them. Everything is trial and error, or expressed in a more educated manner, everything has to be conceived as a hypothesis. Consequently we can reduce the number of variables only by a recurrent mechanism. As a lemma we conclude that any approach that reduces the number of variables not in a recurrent fashion can’t be conceived as a sound approach.

#### Contingent Collinearities

It is the structuredness of the observed entity that cause the similarity of any two observations across all available or apriori chosen properties. We also may expect that any two variables could be quite “similar”6 across all available observations. This provides the first two opportunities for reducing the size of the problem. Note that such reduction by “black-listing” applies only to the first steps in a recurrent process. Once we have evidence that certain variables do not contribute to the predictivity of our model, we may loosen the intensity of any of the reductions! Instead of removing it from the space of expressibility we may preferably achieve a weighted preference list in later stages of modeling.

So, if we find n observations or variables being sufficiently collinear, we could remove a portion p(n) from this set, or we could compress them by averaging.

R1: reduction by removing or compressing collinear records.

R2: reduction by removing or compressing collinear variables.

A feasible criterion for assessing the collinearity is the monotonicity in the relationship between two variables as it is reflected by Spearman’s correlation. We also could apply K-means clustering using all variables, then averaging all observations that are “sufficiently close” to the center of the clusters.

Albeit the respective thresholding is only a preliminary tactical move, we should be aware of the problematics we introduce by such a reduction. Firstly, it is the size of the problem that brings in a notion of irreversibility, even if we are fully aware of the preliminarity. Secondly, R1 is indeed critical because it is in some quite obvious way a petitio principii. Even tiny differences in some variables could be masked by larger differences in such variables that penultimately are recognized as irrelevant. Hence, very tight constraints should be applied when performing R1.

When removing collinear records we else have to care about the outcome indicator. Often, the focused outcome is much less frequent than its “opposite”. Preferably, we should remove records that are marked as negative outcome, up to a ratio of 1:1 between positive and negative outcome in the reduced data. Such “adaptive” sampling is similar to so-called “biased sampling”.

#### Directed Collinearities

Additionally to those two collinearities there is a third one, which is related to the purpose of the model. Variables that do not contribute to the predictive reconstruction of the outcome we could call “empirically empty”.

R3: reduction by removing empirically empty variables

Modeling without a purpose can’t be considered to be modeling at all7, so we always have a target variable available that reflects the operationalization of the focused outcome. We could argue that only those variables are interesting for a detailed inspection that are collinear to the target variable.

Yet, that’s a problematic argument, since we need some kind of model to draw the decision whether to exclude a variable or not, based on some collinearity measure. Essentially, that model claims to predict the predictivity of the final model, which of course is not possible. Any such apriori “determination” of the contribution of a variable to the final predictivity of a model is nothing else than a very preliminary guess. Thus, we indeed should treat it just as a guess, i.e. we should consider it as a propensity weight for selecting the variable. In the first explorative steps, however, we could choose an aggressive threshold, causing the removal of many variables from the vector.

#### Splitting

R1 removes redundancy across observations. The same effect can be achieved by a technique called “bagging”, or similarly “foresting”. In both cases a comparatively small portion of the observations are taken to build a “small” model, where the “bag” or “forest” of all small models then are taken to build the final, compound model. Bagging as a technique of “split & reduce” can be applied also in the variable domain.

R4: reduction of complexity by splitting

#### Confirming

Once an acceptable model or set of models has been built, we can check the postponed variables one after another. In the case of splitting, the confirmation is implicitly performed by weighting the individual small models.

#### Compression and Redirection

Elsewhere we already discussed the necessity and the benefits of separating the transformation of data from the association of observations. If we separate it, we can see that everything we need is an improvement or a preservation of the potential distinguishability of observations. The associative mechanism need not to “see” anything that even comes close to the raw data, as long as the resulting association of observations results in a proper derivation of prototypes.8

This opens the possibility for a compression of the observations, e.g. by the technique of random projection. Random projection maps vector spaces onto each other. If the dimensionality of the resulting vector of reduced size remains large enough (100+), then the separability of the vectors is kept intact. The reason is that in a high-dimensional vector space almost all vectors are “orthogonal” to each other. In other words, random projection does not change the structure of the relations between vectors.

R5: reduction by compression

During the first explorative steps one could construct a vector space of d=50, which allows a rather efficient exploration without introducing too much noise. Noise in normalized vector space essentially means to change the “direction” of the vectors, the effect of changing the length of vectors due to random projection is much less profound. Else note that introducing noise is not a bad thing at all: it helps to avoid overfitting, resulting in more robust models.

If we conceive of this compression by means of random projection as a transformation, we could store the matrix of random numbers as parameters of that transformation. We then could apply it in any subsequent classification task, i.e. when we would apply the model to new observations. Yet, The transformation by random projection destroys the semantic link between observed variables and the predictivity of the model. Any of the columns after such a compression contains information from more than one of the input variables. In order to support understanding, we have to reconstruct the semantic link.

That’s fortunately not a difficult task, albeit it is only possibly if we use an index that allows to identify the observations even after the transformation. The result of the building the model is a collection of groups of records, or indices, respectively. Based on these indices we simply identify those variables, which minimize the ratio of variance within the groups to the variance of the means per variable across the groups. This provides us the weights for the list of all variables, which can be used to drastically reduce the list of input variables for the final steps of modeling.

The whole approach could be described as sort of a redirection procedure. We first neglect the linkage between semantics of individual variables and prediction in order to reduce the size of the task, then after having determined the predictivity we restore the neglected link.

This opens the road for an even more radical redirection path. We already mentioned that all we need to preserve through transformation is the distinguishability of the observations without distorting the vectors too much. This could be accomplished not only by random projection though. If we’d interpret large vectors as a coherent “event” we can represent them by the coefficients of wavelets, built from individual observations. The only requirement is that the observations consist from a sufficiently large number of variables, typically n>500.

Compression is particularly useful, if the properties, i.e. the observed variables do not bear much semantic value in itself, as it is the case in image analysis, analysis of raw sensory data, or even in case of the modeling of textual information.

#### Conclusion

In this small essay we described five ways to reduce large sets of variables, or “assignates” (link) as they are called more appropriately. Since for pragmatic reasons a petitio principii can’t be avoided in attempting such a reduction, mainly due to the inevitable fact that we need a method for it, the reduction should be organized as a process that decreases the uncertainty in assigning a selection probability to the variables.

Regardless the kind of mechanism to associate observations into groups and forming thereby the prototypes, a separation of transformation and association is mandatory for such a recurrent organization being possible.

1. Quine [1]

2. see: the abstract model, modeling and category theory, technical aspects of modeling, transforming data;

3. The “Laplacean Demon” refers to Laplace’s belief that if all parts of the universe could be measured the future development of the universe could be calculated. Such it is the paradigmatic label for determinism. Today we know that even IF we could measure everything in the universe with arbitrary precision we (what we could not, of course) we even could NOT pre-calculate the further development of the universe. The universe does not develop, it performs an open evolution.

4. Rene Thom [2] was the first to explicate the mathematical theory of folds in parameter space, which was dubbed “catastrophe theory” in order to reflect the subjects experience moving around in folded parameter spaces.

5. Alan Turing not only laid the foundations of deterministic machines for performing calculations; he also derived as the first one the formal structure of self-organization [3]. Based on this formal insights we can design the degree of creativity of a system.

impossibility to know for sure is the first and basic reason for culture.

6. note that determining similarity also requires apriori decisions about methods and scales, that need to be confirmed. In other words we always have to start with a belief.

7. Modeling without a purpose can’t be considered to be modeling at all. Performing a clusterization by means of some algorithm is not creating a model until we do not use it, e.g. in order to get some impression. Yet, as soon as we indeed take a look following some goal we imply a purpose. Unfortunately, in this case we would be enslaved by the hidden parameters built into the method. Things like unsupervised modeling, or “just clustering” always implies hidden targets and implicit optimization criteria, determined by the method itself. Hence, such things can’t be regarded as a reasonable move in data analysis.

8. This sheds an interesting light to the issue of “representation”, which we could not follow here.

- [1] WvO Quine. Two Dogmas of Empiricism.
- [2] Rene Thom. Catastrophe Theory
- [3] Alan Turing (1956) Chemical basis of Morphogenesis

۞

## Similarity

December 30, 2011 § 1 Comment

Similarity appears to be a notoriously inflationary concept.

Already in 1979 a presumably even incomplete catalog of similarity measures in information retrieval listed almost 70 ways to determine similarity [1]. In contemporary philosophy, however, it is almost absent as a concept, probably because it is considered merely as a minor technical aspect of empiric activities. Often it is also related to naive realism,which claimed a similarity between a physical reality and concepts. Similarity is also a central topic in cognitive psychology, yet not often discussed, probably for the same reasons as in philosophy.

In both disciplines, understanding is usually equated with drawing conclusions. Since the business of drawing conclusions and describing the kinds and surrounds of that is considered to be the subject of logic (as a discipline), it is comprehensible that logic has been rated by many practitioners and theoreticians alike as the master discipline. While there is a vivid discourse about logical aspects for many centuries now, the role of similarity is largely neglected, and where vagueness makes its way to the surface, it is “analyzed” completely within logic. Also not quite surprising, artificial intelligence focused strongly on a direct link towards propositional logic and predicate calculus for a long period of time. This link has been represented by the programming language “Prolog,” an abbreviation standing for “programming in logic.” It was established in the first half of the 1970ies by the so-called Edinburgh-school. Let us just note that this branch of machine-learning disastrously failed. Quite remarkably, the generally attested reason for this failure has been called the “knowledge acquisition bottleneck” by the community suffering from it. Somehow the logical approach was completely unsuitable for getting in touch with the world, which actually is not really surprising for anyone who understood Wittgenstein’s philosophical work, even if only partially. Today, the logic oriented approach is generally avoided in machine-learning.

As a technical aspect, similarity is abundant in the so-called field of data mining. Yet, there it is not discussed as a subject in its own rights. In this field, as represented by the respective software tools, rather primitive notions of similarity are employed, importing a lot of questionable assumptions. We will discuss them a bit later.

There is a particular problematics with the concept of similarity, that endangers many other abstract terms, too. This problematics appears if the concept is equated with its operationalization. Sciences and engineering are particularly prone for the failure to be aware of this distinction. It is an inevitable consequence of the self-conception of science, particularly the hypothetico-deductive approach [cf. 2], to assign ontological weight to concepts. Nevertheless, such assignment always commits a naturalization fallacy. Additionally, we may suggest that ontology itself is a deep consequence of an overly scientific, say: positivistic, mechanic, etc. world view. Dropping the positivistic stance removes ontology as a relevant attitude.

As a consequence, science is not able to reflect about the concept itself. What science can do, regardless the discipline, is just to propose further variants as representatives of a hypothesis, or to classify the various proposed approaches. This poses a serious secondary methodological problematics, since it equally holds that there is no science without the transparent usage of the concept of similarity. Science should control free parameters of experiments and their valuation. Somewhat surprisingly, almost the “opposite” can be observed. The fault is introduced by statistics, as we will see, and this result really came as a surprise even for me.

A special case is provided by “analytical” linguistics, where we can observe a serious case of reduction. In [3], the author selects the title “Vagueness and Linguistics,” but also admits that “In this paper I focused my discussion on relative adjectives.” Well, vagueness can hardly be restricted to anything like relative adjectives, even in linguistics. Even more astonishing is the fact that similarity does not appear as a subject at all in the cited article (except in a reference to another author).

In the field engaged in the theory of the metaphor [cf. 4, or 5], one can find a lot of references to similarity. In any case known to me it is, however, regarded as something “elementary” and unproblematic. Obviously neither extra-linguistic modeling nor any kind of inner structure of similarity is recognized as important or even as possible. No particular transparent discourse about similarity and modeling is available from this field.

From these observations it is possible in principle to derive two different, and mutually exclusive conclusions. First, we could conclude that similarity is irrelevant for understanding phenomena like language understanding or the empirical constitution. We don’t believe in that. Second, it could be that similarity represents a blind spot across several communities. Therefore we will try to provide a brief overview about some basic topics regarding the concept of similarity.

#### Etymology

Let us add some etymological considerations for a first impression. Words like “similar,” “simulation” or “same” derive all from proto-indoeuropean (PIE) base “*sem-/*som-“, which meant “together, one”, in Old English then “same.” Yet, there is also the notion of “simulacrum” in the “same cloud”; the simulacrum is a central issue in the earliest pieces of philosophy of which we know (Platon) in sufficient detail.

The German word “ähnlich,” being the direct translation of “similar,” derives from Old German (althochdeutsch, well before ~1050 a.c.) “anagilith” [6], a composite from an- and gilith, meaning together something like “angleichen,” for which in English we find the words adapt, align, adjust or approximate, but also “conform to” or “blend.” The similarity to “sema” (“sign”) seems to be only superficial; it is believed that sema derives from PIE “dhya” [7].

If some items are said to be “similar,” it is meant that they are not “identical,” where identical means indistinguishable. To make them (virtually) indistinguish- able, they would have to be transformed. Even from etymology we can see that similarity needs an activity before it can be attested or assigned. Similarity is nothing to be found “there,” instead it is something that one is going to produce in a purposeful manner. This constructivist aspect is quite important for our following considerations.

#### Common Usage of Similarity

In this section, we will inspect the usage of the concept of “similarity” in some areas of particular relevance. We will visit cognitive psychology, information theory, data mining and statistical modeling.

##### Cognitive Psychology

Let us start with the terminology that has been developed in cognitive psychology, where one can find a rich distinction of the concept of similarity. It started with the work of Tversky [8], while Goldstone provides a useful overview more recently [9].

Tversky, a highly innovative researcher on cognition, tried to generalize the concept of similarity. His intention is to overcome the typical weakness of “geometric models”, which “[…] represent objects as points in some coordinate space such that the observed dissimilarities between objects correspond to the metric distances between the respective points.”1 The major assumption (and drawback) of geometric models is the (metric) representability in coordinate space. A typical representative of “geometric models” as Tversky calls them employs the nowadays widespread Euclidean distance as an operationalization for similarity.

A new set-theoretical approach to similarity is developed in which objects are represented as collections of features, and similarity is described as a feature-matching process. Specifically, a set of qualitative assumptions is shown to imply the contrast model, which expresses the similarity between objects as a linear combination of the measures of their common and distinctive features.

Tversky’s critique in the “geometrical approach” applies only if two restrictions are active: (1) If one would disregard missing values, which actually is the case for most of the practices. (2) If the dimensional interpretation is considered to be stable and unchangeable, no folding or warping of the data space via transformation of the measured data will be applied.

Yet, it is neither necessary to disregard missing values in a feature-based approach nor to dismiss dimensional warping. Here Tversky does not differentiate between form of representation and the actual rule for establishing the similarity relation. This conflation is quite abundant in many statements about similarity and its operationalization.

What Tversky effectively has been proposing is now known as binning. The approach by him is based on features, though in a way quite different (at first sight) from our proposal, as we will show below. Yet, not the values of the features are compared, but instead the two sets on the level of the items by means of a particular ratio function. In a different perspective, the data scale used for assessing similarity is reduced to the ordinal or even the nominal scale. Tversky’s approach thus is prone to destroy information present in the “raw” signal.

An attribute (Tversky’s “feature”) that occurs in different grades or shades is translated into a small set of different, distinct and mutually exclusive features. Tversky obviously does not recognize that binning is just one out of many, many possible ways to deal with observational data, i.e. to transform it. Applying a particular transformation based on some theory in a top-down manner is equivalent to the claim that the selected transformation builds a perfect filter for the actually given data. Of course, this claim is deeply inadequate (see the chapter about technical aspects of modeling). Any untested, axiomatically imposed algorithmic filter may destroy just those pieces of information that would have been vital to achieve a satisfying model. One simply *can’t* know before.

Tversky’s approach built on feature sets. The difference of those sets (on a nominal level) should represent the similarity and are expressed by the following formula:

s(a,b) = F(A ∩ B, A-B, B-A). eq.1

which Tversky describes in the following way:

The similarity of a to b is expressed as a function F of three arguments: A ∩ B, the features that are common to both a and b; A-B, the features that belong to a but not to b; B-A, the features that belong to b but not to a.

This formula reflects also what he calls “contrast.” (It is similar to Jaccard’s distance, so to speak, an extended practical version of it) Yet, Tversky, like any other member of the community of cognitive psychologist referring to this or a similar formula, did not recognize that the features, when treated in this way, are all equally weighted. It is a consequence of sticking to set theory. Again, this is just the fallback position of initial ignorance of the investigator. In the real world, however, features are differentially weighted, building a context. In the chapter about the formalization of the concept of context we propose a more adequate possibility to think about feature sets, though our concept of context shares important aspects with Tversky’s approach.

Tversky emphasizes that his concept does not consist from just one single instance or formula. He introduces weighting factors for the terms of eq.1, which then leads to families of similarity functions. To our knowledge this is the only instance (besides ours) arguing for a manifold regarding similarity. Yet, again, Tversky still does not draw the conclusion, that the chosen instance of a similarity “functional” (see below) has to be conceived just a hypothesis.

In cognitive psychology (even today), the term “feature-based models” of similarity does *not* refer to feature vectors as they are used in data mining, or even generalized vectors of assignates, as we proposed it in our concept of the generalized model. In Tversky’s article this becomes manifest on p.330. Contemporary psychologists like Goldstone [9] distinguish four different ways of operationalizing similarity: (1) geometric, (2) feature-based, (3) alignment-based, and (4) transformational similarity. For Tversky [8] and Goldstone, the label “geometric model” refers to models based on feature vectors, as they are used in data mining, e.g. as Euclidean distance.

Our impression is that cognitive psychologist fail to think in an abstract enough manner about features and similarity. Additionally, it seems that there is a tendency to the representationalist fallacy. Features are only recognized as features as far as they appear “attached” to the object for human senses. Dropping this attitude it becomes an easy exercise to subsume all those four types in a feature-vector approach, that (1) allows for missing values and assigns them a “cost”, and which (2) is not limited to primitive distance functions like Euclidean or Hamming distance. The underdeveloped generality is especially visible concerning the alignment or transformational subtype of similarity.

A further gap in the similarity theory in cognitive psychology is the missing separation between the operation of comparison and the operationalization of similarity as a projective transformation into a 0-dimensional space, that is a scalar (a single value). This distinction is vital, in our opinion, to understand the characteristics of comparison. If one does not separate similarity from comparison, it becomes impossible to become aware of higher forms of comparison.

##### Information theory

A much more profound generalization of similarity, at least at first sight, has been proposed by Dekang Lin [10], which is based on an “information-theoretic definition of similarity that is applicable as long as there is a probabilistic model.” The main advantage of this approach is its wide applicability, even in cases where only coarse frequency data are available. Quite unfortunately, Lin’s proposal neglects a lot of information if there are accurate measurements in the form of feature-vectors. Besides the comparison of strings and statistical frequency distributions, Lin’s approach is applicable to sets of features, but not to profile-based data, as we propose for our generalized model.

##### Data Mining

Data mining is an distinguished set of tools and methods that are employed in a well-organized manner in order to facilitate the extraction of relevant patterns [11], either for predictive or for diagnostic purposes. Data Mining (DM) is often conceived as a part of so-called “knowledge discovery,” building the famous abbreviation KDD: knowledge discovery in databases [11]. In our opinion, the term “data mining” is highly misleading, and “knowledge discovery” even deceptive. In contrast to earthly mining, in the case of information the valuable objects are not “out there” like minerals or gems, while knowledge can’t be “discovered” like diamonds or physical laws. Even the “retrieval” of information is impossible by principle. To think otherwise dismisses the need of interpretation and hence contradicts widely acknowledged positions in contemporary epistemology. One has to know that the terms “KDD” and “data mining” are shallow marketing terms, coined to release the dollars of naive customers. Yet, KDD and DM are myths many people believe in and which are reproduced in countless publications. As concepts, they simply remain to be utter nonsense. As a non-sensical practice that is deeply informed by positivism, it is harmful for society. It is more appropriate to call the respective activity more down-to-earth just *diagnostic* or *predictive modeling* (which actually is equivalent).

Any observation of entities takes place along apriori selected properties, often physical ones. This selection of properties is part of the process of creating an operationalization, which actually means to make a concept operable through making it measurable. Actually, those properties are not “natural properties of objects.” Quite to the contrast, objecthood is created by the assignment of a set of features. This inversion is often overlooked in data mining projects, and consequently also the eminently constructive characteristics of data-based modeling. Hence, it is likewise also not correct to call it “data analysis”: an analysis does not add anything. Predictive/ diagnostic models are constructed and synthesized like small machines. Models may well be conceived as an informational machinery. To make our point clear: nobody among the large community of machine-building engineers would support the view that any machine comes into existence just through “analysis.”

Given the importance of similarity in comparison, it is striking to see that in many books about data mining the notion of “similarity” does not appear even a single time [e.g. 12], and in many more publications only in a very superficial manner. Usually, it is believed that the Euclidean distance is a sound, sufficient and appropriate operationalization of similarity. Given its abundance, we have to take a closer look to this concept, how it works, and how it fails.

##### Euclidean Distance and its Failure

We already met the idea that objects are represented along a set of selected features. In the chapter about comparison we saw that in order to compare items of a population of objects, those objects are to be compared on the basis of a selected and shared feature set. Next, it is clear that for each of the features some values can be measured. For instance, presence could be indicated by the dual pair of values 1/0. For nominal values like names re-scaling mechanisms have been proposed [13]. Such, any observation can be transformed into a table of values, where the (horizontal) rows represent the objects and the columns describe the features.

We also can say that any of the objects contained in such a table is represented by a profile. Note that the order of the columns (features) is arbitrary, but it is also constant for all of the objects covered by the table.

The idea now is that each of the columns represent a dimension in a Cartesian, orthogonal coordinate system. As a preparatory step, we normalize the data, i.e. for each single column the values contained in it are scaled such that the ratios remain unchanged, but the absolute values are projected into the interval [0..1].

By means of such a representation any of the objects (=data rows) can be conceived as a particular point in the space spanned by the coordinate system. The similarity S then is operationalized as the “inverse” of the distance, S=1-d, between any of the points. The distance can be calculated according to the Euclidean formula for the length of the hypotenuse in the orthogonal triangle (2d case). In this way, the points are understood as the endpoint of a vector that starts in the origin of the coordinate system. Thus, this space is often called “data space” or “vector space.” The distance is called “Euclidean distance.”

Since all of the vectors are within the unit sphere (any value is in [0..1]), there is another possibility for an operationalization of the similarity. Instead of the distance one could take the angle between any two of those vectors. This yields the so-called cosine-measure of (dis-)similarity.

Besides the fact that missing values are often (and wrongly) excluded from a feature-vector-based comparison, this whole procedure has a serious built-in flaw, whether as cosine- or as Euclidean distance.

The figure 1a below shows the profiles of two objects above a set of assignates (aka attributes, features, properties, fields). The embedding coordinate space has k dimensions. One can see that the black profile (representing object/case A) and the red profile (representing object/case B) are fairly similar. Note that within the method of the Euclidean distance all ai are supposed to be independent from each other.

**Figure 1a:** Two objects A’ and B’ has been represented as profiles A, B across a shared feature vector ai of size k ;

Next, we introduce a third profile, representing object C. Suppose that the correlation between profiles A and C is almost perfect. This means that the inner structure of objects A and C could be considered to be very similar. Some additional factor just might have damped the signal, such all values are proportionally lower by an almost constant ratio when compared to values measured from object A.

**Figure 1b:** Compared to figure 1a, a third object C’ is introduced as a profile C; this profile causes a conflict about the order that should be induced by the similarity measure. There are (very) good reasons, from systems theory as well as from information theory, to consider A and C more similar to each other than either A-B or B-C. Nevertheless, employing Euclidean distance will lead to a different result, rating the pairing A-B as the most similar one.

The particular difficulty now is given by the fact, that it depends on some objections that are completely outside of the chosen operationalization of similarity, which two pairs of observations are considered more similar to each other. Yet, this dependency inverses the intended arrangement of the empiric setup. The determination of the similarity actually should be used to decide about those outside objections. Given the Euclidean distance, A and B are clearly much more similar to each other than either A-C or B-C. Using in contrast a correlative measure would select A-C as the most similar pairing. This effect gets more and more serious the more assignates are used to compare the items.

Now imagine that there are many observations, dozens, hundreds or hundreds of thousands, that serve as a basis for deriving an intensional description of all observations. It is quite obvious that the final conclusions will differ drastically upon the selection of the similarity measure. The choice of the similarity measure is by no means only of technical interest. The particular problematics, but also, as we will see, the particular opportunity that is related to the operationalization of similarity consists in the fact that there is a quite short and rather strong link between a technical aspect and the semantic effect.

Yet, there are measures that reflect the similarity of the form of the whole set of items more appropriately, such like least-square distances, or measures based on correlation, like the Mahalanobis distance. However, these analytic measures have the disadvantage of relying to certain global parametric assumptions, such as normal distribution. And they do not completely resolve the situation shown in figure 1b even in theory.

We just mentioned that the coherence of value items may be regarded as a form. Thus, it is quite natural to use a similarity measure that is derived from geometry or topology, which also does not suffer from any particular analytic apriori assumption. One such measure is the Hausdorff metric, or more general the Gromov-Hausdorff metric. Being developed in geometry they find their “natural” application in image analysis, such as partial matching of patterns to larger images (aka finding “objects” in images). For the comparison of profiles we have to interpret them as figures in a 2-dimensional space, with |ai|/2 coordinate points. Two of such figures are then prepared to be compared. The Hausdorff distance is also quite interesting because it allows to compare whole sets of observations, not only as two paired observations (profiles) interpreted as coordinates in ℝ2, but also three observations as ℝ3, or a whole set of n observations, arranged as a table, as a point cloud in ℝn. Assuming compactness, i.e. a macroscopic world without gaps, we may interpret them also as curves. This allows to compare whole sub-sets of observations *at once*, which is a quite attractive feature for the analysis of relational data. As far as we know, nobody ever used the Hausdorff metric in this way.

Epistemologically, it is interesting that a topologically inspired assessment of data provides a formal link between feature-based observations and image processing. Maybe, this is relevant for the subjective impression to think in “images,” though nobody has ever been able to “draw” such an image… This way, the idea of “form” in thought could acquire a significant meaning.

Yet, already in his article published more than 30 years ago, Tversky [8] mentioned that the metric approach is barely convincing. He writes (p.329)

The applicability of the dimensional assumption is limited, […] minimality is somewhat problematic, symmetry is apparently false, and the triangle inequality is hardly compelling.

It is of utmost importance to understand that the selection of the similarity measure as well as the selection of the feature to calculate it are by far the most important factors in the determination, or better hypothetical presupposition, of the similarity between the profiles (objects) to be compared. The similarity measure and feature selection is by far more important than the selection of a particular method, i.e. a particular way of organizing the application of the similarity measure. Saying “more important” also means that the differences in the results are much larger between different similarity measures than between methods. From a methodological point of view it is thus quite important that the similarity measure is “accessible” and not buried in a “heap of formula.”

Similarity measures that are based only on dimensional interpretation and coordinate spaces are not able to represent issues of form and differential relations, what is also (and better) known as “correlation.” Of course, other approaches different from correlation that would reflect the form aspect of the internal relations of a set of variables (features) would do the job, too. We just want to emphasize that the assumption of perfect independence among the variables is “silly” in the sense that it contradicts the “game” that the modeler actually pretends to play. This leads more often than not to irrelevant results. The serious aspect about this is, however, given by the fact that this deficiency remains invisible when comparing results between different models built according to the Euclidean dogma.

There is only one single feasible conclusion from this: Similarity can’t be regarded as property of actual pairings of objects. The similarity measure is a free parameter in modeling, that is, nothing else than a hypothesis, though on the structural level. As a hypothesis, however, it needs to be tested for adequacy.

##### Similarity in Statistical Modeling

In statistical modeling the situation is even worse. Usually, the subject of statistical modeling is not the individual object or its representation. The reasoning in statistical modeling is strongly different from modeling in predictive modeling. Statistics compares populations, or at least groups as estimates of populations. Dependent on the scale of the data, the amount of data, the characteristics of data and the intended argument a specialized method has to be picked from a large variety of potential methods. Often, the selected method also has to be parameterized. As a result, the whole process of creating a statistical model is more “art” than science. Results of statistical “analysis” are only approximately reproducible across analysts. It is indeed kind of irony that at the heart of quantitative science one finds a non-scientific methodological core.

Anyway, our concern is similarity. In statistical modeling there is no similarity function visible at all. All that one can see is the result and proposals like “population B is not part of population A with a probability for being false positive of 3%.” Yet, the final argument that populations can be discerned (or can’t) is obviously also an argument about the probability for a correct assignment of the members of the compared populations. Hence, it is also clearly an argument about the group-wise as well as the individual similarity of the objects. The really bad thing is the similarity function is barely visible at all. Often it is some kind of simple difference between values. The main point is that it is not possible to parametrize the hidden similarity function, except by choosing the alpha level for the test. It is “distributed” across the whole procedure of the respective method. In its most important aspect, any of the statistical methods has to be regarded as a black box.

These problems with statistical modeling are prevalent even across the general framework, i.e. whether one chooses a frequentist or a Bayesian attitude. Recently, Alan Hajek [14] proofed that statistics is a framework that in all its flavors suffers from the reference class problem. Cheng [15] correctly notes about the reference class problem that

“At its core, it observes that statistical inferences depend critically on how people, events, or things are classified. As there is (purportedly) no principle for privileging certain categories over others, statistics become manipulable, undermining the very objectivity and certainty that make statistical evidence valuable and attractive …”

So we can see that the reference class problem is just a corollary of the fact that the similarity function is not given explicitly and hence also is not accessible. Thus, Cheng seeks unfulfillable salvation by invoking the cause of defect itself: statistics. He writes

I propose a practical solution to the reference class problem by drawing on model selection theory from the statistics literature.

Despite he is right in pointing to the necessity of model selection, he fails to recognize that statistics can’t be helpful in this task. We find it interesting that this author (Cheng) has been writing for the community of law theoreticians. This sheds bright light onto the relevance of an appropriate theory of modeling.

As a consequence we conclude that statistical methods should not be used as the main tool for any diagnostic/predictive modeling of real-world data. The role of statistical methods in predictive/diagnostic modeling is just the same as that of any other transformation: they are biased filters, whose adequacy has to be tested, nothing less, and, above all, definitely nothing more. Statistics should be used only within completely controllable, hence completely closed environments, such like simulations, or “data experiments.”

#### The Generalized View

Before we are going to start we would like to recall the almost trivial aspect that the concept of similarity makes sense exclusively in the context of diagnostic/ predictive modeling, where “modeling” refers to the generalized model, which in turn is part of a transcendental structure.

After having briefly discussed the relation of the concept of similarity to some major domains of research, we now may turn to the construction/description of a proper concept of similarity. The generalized view that we are going to argue for should help determining the appropriate mode of speaking about similarity.

##### Identity

Identity is often seen as the counterpart of similarity, or also as some kind of a simple asymptotical limit to it. Yet, these two concepts are so deeply incommen-surable that they can not be related at all.

One could suggest that identity is a relation that indicates a particular result of a comparison, namely indistinguishability. We then also could say that under any possible transformation applied to identical items the respective item remain indistinguishable. Yet, if we compare two items we refer to the concept of similarity, from which we want to distinguish it. Thus it is clear that identity and similarity are structurally different. There is no way from one to the other.

In other words, the language game of identity excludes any possibility for a comparison. We can set it only by definition, axiomatically. This means that not only the concepts can’t be related to each other, additionally he see that the subjects of the two concepts are categorically different. Identity is only meaningful as an axiomatically introduced equality of symbols.

In still other words we could say that identity is restricted to axiomatically defined symbols in formal surrounds, while similarity is applicable only in empirical contexts. Similarity is not about symbols, but about measurement and the objects constructed from it.

This has profound consequences.

First, identity can’t be regarded as a kind of limit to which similarity would asymptotically approximate. For any two objects that have been rated as being “equal,” notably through some sort of comparison, it is thus possible to find a perspective under which they are not equal any more.

Second, it is impossible to take an existential stance towards similarity. Similarity is a the result of an action, of a method or technique that is embedded in a community. Hence it is not possible to assign similarity an ontic dimension. Similarity is not part of any possible ontology.

We can’t ask “What is similarity?”, we also can not even pretend to determine “the” similarity of two subjects. “Similarity” is a very particular language game, much like its close relatives like vagueness. We only can ask “How to speak about similarity?”

Third, it is now clear that there is no easy way from a probabilistic description to a propositional reference. We already introduced this in another chapter, and we will deal dedicatedly elsewhere with it. There is no such transition within a single methodology. We just see again how far Wittgenstein’s conclusion about the relation of the world and logic is reaching. The categorical separation between identity and similarity, or between the empiric and the logic can’t be underestimated. For our endeavor of a machine-based epistemology it is of vital interest to find a sound theory for this transition, which in any of the relevant research areas has not been even recognized as a problematic subject so far.

##### Practical Aspects

Above we have seen that any particular similarity measure should be conceived as part of a general hypothesis about the best way to create an optimized model. Within such a hypothesizing setting we can distinguish two domains embedded into a general notion of similarity. We could align these two modes to the distinction Peirce introduced with regard to uncertainty: probability and verisimilitude [16]. Such, the first domain regards the variation along assignates that are shared among two items. From the perspective of any of the compared items, there is complete information about the extension of the world of that item. Any matter is a matter of degree and probability, as Peirce understood it. Taking the perspective of Deleuze, we could call it also possibility.

The second domain is quite different. It is concerned with the difference of the structure of the world as it is accessible for each of the compared items, where this difference is represented by a partial non-matching of the assignates that provide the space for measurement. Here we meet Tversky’s differential ratio that builds upon differences in the set of assignates (“features,” as he called it) and can be used also to express differential commonality.

Yet, the two domains are not separated from each other in an absolute manner. The logarithm is a rather simple analytic function with some unique properties. For instance, it is not defined for argument values [-∞..0]. The zero (“0”), however, in turn can be taken to serve as a double articulation that allows to express two very different things: (1) through linear normalization the lowest value of the range, and (2) the (still symbolic) absence of a property. Using the logarithm then, the value “0” gets transformed into a missing value, because the logarithm is not defined for arg=0, that is, we turn the symbolic into a quasi-physical absence. The *extremely* (!) valuable consequence of this is that by means of the logarithmic transformation we can change the feature vector on the fly in a context dependent manner, where “context” (i) can denote any relation between variables of values therein, and (ii) may be related to certain segments of observations. Even the (extensional) items within a (intensional) class or empirical category may be described by dynamically regulated sets of assignates (features). In other words, the logarithmic transformation provides a plain way towards abstraction. Classes as clusters are not just comprising items homogenized by the identical feature set. Hence, it is a very powerful means in predictive modeling.

Given the two domains in the practical aspects of similarity measures it is now becoming more clear that we indeed need to insist on a separation of assignates and the mapping similarity function, as we did in the chapter about comparison. We reproduce Figure 2b from that chapter:

Figure 2: Schematic representation of the comparison of two items. Items are compared along sets of “attributes,” which have to be assigned to the items, indicated by the symbols {a} and {b}.

The set of assignates symbolized as {a} or {b} for items A, B don’t comprise just the “observable” raw “properties.” Of course, all those properties are selected and assigned by the observer, which results in the fact that the observed items are literally established only through this measurement step. Additionally, the assigned attributes, or better “assignates,” comprise also all transformations of raw , primary assignates, building two extended sets of assignates. The similarity function then imposes a rule for calculating the scalar (a single value) that finally serves as a representation of the respective operationalization. This function may represent any kind of mapping between the extended set of assignates.

Such a mapping (function) could consists of a compound of weighted partial functions, according to the proposals of Tversky or Cheng and a particular profile mapping. Sets {a} and {b} need not be equal, of course. One could even apply the concept of formalized contexts instead of a set of equally weighted items. Nevertheless, there remains the apriori of the selection of the assignates, that precedes the calculation of the scalar. In practical modeling this selection will almost for sure lead to a removal of most of the measured “features.”

Above we said that any similarity measure must be considered as a free parameter in modeling, that is, as nothing else than a hypothesis. For the sake of abstraction and formalization this requires that we generalize the single similarity function into a family of functions, which we call “functional.” In category theoretic terms we could call it also a “functor.” The functor of all similarity functions then would be part of the functor representing the generalized model.

##### Formal Aspects

In the chapter about the category of models we argued that models can not be conceived in a set theoretic framework. Instead, we propose to describe models and the relations among them on the level of categories, namely the category of functors. In plain words, models are *transformative relations*, or in a term from category theory, arrows. Similarity is a dominant part of those transformative relations.

Before this background (or: within this framing), we could say that similarity is a property of the arrow, while a particular similarity function represents a particular transformation. By expressing similarity as a value we effectively map these properties of the arrow to an scalar, which could be a touchable value or which could be an abstract scalar. Even more condensed, we could say that in a general perspective:

*Similarity can be conceived as a mapping of relations onto a scalar.*

This scalar should not be misunderstood as the value of the categorical “arrow.” Arrows in category theory are not vectors in a coordinate system. The assessment of similarity thus can’t be taken just as kind of a simple arithmetic transformation. As we already said above from a different perspective, similarity is not a property of objects.

Since similarity makes sense only in the context of comparing, hence in the context of modeling, we also can recognize that the value of this scalar is dependent on the purpose and its operationalization, the target variable. Similarity is nothing which could be measured. For 100% it is the result of an intention.

##### Similarity and the Symbolic

It is more appropriate to understand it as the actualization of a potential. Since the formal result of this actualization is a scalar, i.e. a primitive with only a simple structure, this actualization prepares also the ground for the possibility of a new symbolization. The similarity scalar is able to take three quite different roles. First, it can act as a criterion to impose a differential semi-order under ceteris paribus conditions for modeling. Actual modeling may be strongly dominated by arbitrary, but nevertheless stable habits. Second, the similarity scalar also could be taken as an ultimate “abbreviation” of a complex activity. Third, and finally, the scalar may well appear as a quasi-material entity due to the fact that there is so little inner structure to it.

*It is the “Similarity-Game” that serves as a ground for hatching symbols.*

Imagine playing this game according to the Euclidean rules. Nobody could expect rich or interesting results, of course. The same holds if “similarity” is misunderstood as technical issue, which could be represented or determined as a closed formalism.2

It is clear, that these results are also quite important to understand the working of metaphors in practiced language. Actually, we think that there is no other mode of speaking in “natural,” i.e. practiced languages than the metaphorical mode. The understanding of similarity as a ground for hatching symbols directly leads to the conclusion that words and arrangements of words do not “represent” or “refer to” something. Even more concise we may say that neither things nor signs or symbols are able to carry references. Everything is created in the mind. Yet, and still refuting radical constructivism, we suggest that the tools for this creative work are all taken from the public.

#### Conclusions

As usual, we finally address the question about the relevance of our achieved results for the topic of machine-based epistemology.

From a technical perspective, the most salient insight is probably the relativity of similarity. This relativity renders similarity into a strictly non-ontological concept (we anyway think that the idea of “pure” ontology is based on a great misunderstanding). Despite the fact that it is pretended thousands of times each day that “the” similarity has been calculated, such “calculation” is not possible. The reason for this is simply that (1) it is not just a calculation as for instance, the calculation of the Easter date, and (2) there is nothing like “the” similarity that could be calculated.

In any implementation that provides means for the comparison of items we have to care for an appropriate generality. Similarity should never be implemented as formula, but instead as an (templated, or abstract) object. Another (not only) technical aspect concerns the increasing importance of the “form-factor” when comparing profiles the more assignates are used to compare the items. This should be respected in any implementation of a similarity measure by increasing the weight of such “correlational” aspects.

From a philosophical perspective there are several interesting issues to mention. It should be clear that our notion of similarity is not following the realist account. Our notion of similarity is not directed towards the relation of “objects” in the “physical world” and “concepts” in the “mental world.” Please excuse the inflationary usage of quotation marks, yet it is not possible otherwise to repel realism in such sentences. Indeed, we think that the similarity can’t be applied to concepts at all. Trying to do so [e.g. 17] one would commit a double categorical mistake: First, concepts may arise exclusively as an *embedment* (not:entailment) of symbols, which in turn require similarity as an operational field. It is impossible to apply similarity to concepts without further interpretation. Second, concepts can’t be positively determined and they are best conceived as transcendental choreostemic poles. This categorically excludes the application of the concept of similarity to the concept of concepts. A naturalization by means of (artificial) neuronal structures [e.g. 18] is missing the point even more dramatically.3 “Concept” and “similarity” are mutually co-extensive, “similarly” to space and time.

As always, we think that there is the primacy of interpretation, hence it is useless to talk about a “physical world as it is as-such.” We do not deny that there is a physical outside, of course. Through extensive individual modeling that is not only shared by a large community, but also has to provide some anticipatory utility, we even may achieve insights, i.e. derive concepts, that one could call “similar” with regard to a world. But again, this would require a position outside of the world AND outside of the used concepts and practiced language. Such a position is not available. “Objects” do not “exist” prior to interpretation. “Objecthood” derives from (abstract) substance by adding a lot of particular, often “structural,” assignates within the process of modeling, most salient by imposing purpose and the respective instance of the concept of similarity.

There are two conclusions from that. First, similarity is a purely operational concept, it does not imply any kind of relation to an “existent” reference. Second, it would be wrong to limit similarity (and modeling, anticipation, comparison, etc.) to external entities like material “objects.” With the exception of pure logic, we always have to interpret. We interpret by using a word in thought, we interpret even if our thoughts are shapeless. Thinking is an open, processual system of cascaded modeling relations. Modeling starts with the material interaction between material aspects of “objects” or bodies, it takes place throughout the perception of external differences, the transduction and translation of internal signals, the establishment of intensions and concepts in our associative networks, up to the ideas of inference and propositional content.

Our investigations ended by describing similarity as a scalar. This dimensionless appearance should not be misunderstood in a representationalist manner, that is, as indication that similarity does not have a structure. Our analysis revealed important structural aspects that relate to many areas in philosophy.

In other chapters we have seen that modeling and comparing are inevitable actions. Due to their transcendental character we even may say that they are inevitable events. As subjects, we can’t evade the necessity of modeling. We can do it in a diagnostic attitude, directed backward in time, or we can do it in a predictive or anticipatory attitude, directed forward in time. Both directions are connected through learning and bridged by Peirce’s sign situation, but any kind of starting point reduces to modeling. If spelled out in a sudden manner it may sound strange that modeling and comparing are deeply inscribed into the event-structure of the world. Yet, there are good reasons to think so.

Before this background, similarity denotes a hot spot for the actualization of intentions. As an element in modeling it is the operation to transport purposes into the world and its perception. Even more concentrated we may call similarity the *carrier of purpose*. For all other elements of modeling besides the purpose and similarity one can refer to “necessities,” such like material constraints, or limitations regarding time and energy. (Obviously, contingent selections are nothing one can speak about in other ways than just by naming them, they are singularities.)

Saying this it is is clear that the relative neglect of similarity against logic should be corrected. Similarity is the (abstract) hatching-ground for symbols, so to say, in Platonic terms, *the sky for the ideas*.

Notes

1. The geometrical approach is largely equal to what today is known as the feature vector approach, which is part of any dimensional mapping. Examples are multi-dimensional scaling, principal component analysis, or self-organizing maps.

2. Category theory provides a formalism that is not closed, since categories can be defined in terms of category theory. This self-referentiality is unique among formal approaches. Examples for closed formalisms are group theory, functional analysis or calculi like λ-calculus.

3. Besides that, Christoph Gauker provided further arguments that concepts could not conceived as regions of similarity spaces [19].

- [1] McGill, M.,Koll, M., and Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Final report for grant NSF-IST-78-10454 to the National Science Foundation, Syracuse University.
- [2] Wesley C. Salmon.
- [3] van Rooij, Robert. 2011c. Vagueness and linguistics. In: G. Ronzitti (ed.), The vagueness handbook, Springer New York, 2011.
- [4] Lakoff
- [5] Haverkamp (ed.), Metaphorologie
- [6] Duden. Das Herkunftswörterbuch. Die Etymologie der Deutschen Sprache. Mannheim 1963.
- [7] Oxford Encyclopedia of Semiotics: Semiotic Terminology. available online, last accessed 29.12.2011.
- [8] Amos Tversky (1977), Features of Similarity. Psychological Review, Vol.84, No.4. available online
- [9] Goldstone. Comparison. Springer, New York 2010.
- [10] Dekang Lin, An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th International Conference on Machine Learning ICML, 1998, pp. 296-304. download
- [11] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). From Data Mining to Knowledge Discovery in Databases. American Association for Artificial Intelligence, p.37-54.
- [12] Thomas Reinartz, Focusing Solutions for Data Mining: Analytical Studies and Experimental Results in Real-World Domains (LNCS) Springer, Berlin 1999.
- [13] G. Nakaeizadeh (ed.). Data mining. Physica Weinheim, 1998.
- [14] Alan Hájek (2007), The Reference Class Problem is Your Problem Too. Synthese 156: 185-215. available online.
- [15] Edward K. Cheng (2009), A Practical Solution to the Reference Class Problem. COLUMBIA LAW REVIEW Vol.109:2081-2105. download
- [16] Peirce, Stanford Encyclopedia.
- [17] Tim Schroeder (2007), A Recipe for Concept Similarity. Mind & Language, Vol.22 No.1. pp. 68–91.
- [18] Churchland
- [19] Christoph Gauker (2007), A Critique of the Similarity Space Theory of Concepts. Mind & Language, Vol.22 No.4, pp.317–345.

۞

## Technical Aspects of Modeling

December 21, 2011 § Leave a comment

Modeling is not only inevitable in an empirical world,

it is also a primary practice. Being in the world as an empirical being thus means continuous modeling. Modeling is not an event-like activity, it is much more like collecting the energy in an photovoltaic device. This does not apply only to living systems, it also should apply to economic organizations.

Modeling thus comprises much more than selecting some data from a source and applying a method or algorithm to them. You may conceive that difference metaphorically as the difference between a machine and a plant for producing some goods. Here in this chapter we first will identify and briefly characterize the elements of continuous modeling; then we will show the overall arrangement of those elements as well as the structure of the modeling core process. We then will even step further down to the level of properties a modeling software for (continuous) modeling should comprise.

You will find much more details and a thorough discussion of the various design decisions for the respective software system in the attached document “The SPELA-Approach to Predictive Modeling.” This acronym stands for “Self-Configuring Profile-Based Evolutionary Learning Approach.” The document also describes how the result of modeling based on SPELA may be used properly for reasoning about the data.

#### Elements of Modeling

As we have shown in the chapter about the generalized model, a model needs a purpose. This targeted modeling and its internal organization is the subject of this chapter. Here we will not deal with the problem of modeling unstructured data such as texts or images. Understanding language tokens requires a considerable extension of the modeling framework, despite the fact, that modeling as outlined here remains an important part of understanding language tokens. Those extensions mainly concern an appropriate probabilization of what we experience as words or sentences. We will discuss this elsewhere, more technical here, fully contextualized here.

Goal-oriented modeling can be automated to a great extent, if an appropriate perspective to the concept of model is taken (see chapters about the generalized model, and model as categories).

Such automated modeling also can be run as a continuous process. Its main elements are the following:

- (1) post-measurement phase: selecting and importing data;
- (2) extended classification by a core process group:building an intensional representation;
- (3) reflective post-processing (validation), meta-modeling, based on a (self-)monitoring repository;
- (4) harvesting results and/or aligning measurement

.

#### Overall Organization

The elements of (continuous, automated) modeling needs to be arranged according to the following multi-layered, multi-loop organizational scheme:

**Figure 1:** Organizational elements for automated, continuous modeling; **L<n>** = loop levels; **T**=transformation of data, **S**=segmentation of data, **R**=detecting pairwise relationships between variables and identifying them as mathematical functions F(d)=f(x,y), F(d) being a non-linear function improving the discriminant power represented by the typology derived in S, **PP**=post-processing, e.g. creating dependency path diagrams, which are connected through 4 levels of loops, L1 thru L4, where **L1**=finding a semi-ordered list of optimized models, **L2**=introducing the relationships found by R into data transformation, **L3**=additional sampling of raw data based on post-processing of core process group (active sampling), e.g. for cross-validation purposes, and finally **L4**=adapting the objective of the modeling process based on the results presented to the *user*. Feedback-level L4 may be automated through selecting from pre-configured modeling policies, where a policy is a set of rules and sets of parameters controlling the feedback levels L1 thru L3 as well as the core modules. DB = some kind of data source, e.g. a database;

This scheme may be different to anything you have seen so far about modeling. Common software packages, whether commercial (SPSS, SAS, S-Plus, etc.) or open source (R, Weka, Orange) do not natively support this scheme. Some of them would allow for a similar scheme, but it is hard to accomplish it. For instance, the transformation part is not properly separated and embedded in the overall process, there is no possibility to screen for pairwise relationships, which then are automatically actualized as further transformation of the “data.” There is no meta-data and no abstraction inherent to the process. As a consequence, literally everything is left on the side of the user, rendering those softwares into gigantic formalisms. This comes, on the other hand, only with little surprise, given the current paradigm of deterministic computing.

The main reason, however, for the incapability of any of these softwares is the inappropriate theory behind them. Neither the paradigm of statistics nor that of “data mining” is applicable at all to the task of automated and continuous modeling.

Anyway, next we will describe the loops appearing in the scheme. The elements of the core process we will describe in detail later.

Here we should mention another process-oriented approach for predictive modeling, the CRISP-M scheme, which has been published as early as 1997 as a result of an initiative launched by NEC. CRISP-M stands for Cross-industry standard for predictive modeling. However, the CRISP-M is of a hopelessly solipsistic character and only of little value.

Before we start we should note that the scheme above reflects an ideal and rather simple situation. More often than not, a nested, if not fractal structure appears, especially regarding loop levels L1 and L2.

##### Loop Level 1: Resonance

Here we find the associative structure, e.g. a self-organizing map. An important requirement is that this mechanism is working bottom-up, and a consequence of this is that it is an approximate mechanism.

The purpose of this element is to perform a segmentation of the available data, given a particular data space as defined by the “features,” or more appropriate, the assignates (see the chapter about the generalized model for this).

It is important to understand, that it is impossible for the segmentation mechanism to change the structure of the available data space. Loop level L1 also provides what is called the transition from extensional to intensional description.

L1 performs also feature selection. Given a set of features FO, representing descriptional “dimensions” or aspects of the observations O, many of those features are not related to the intended target of the model. Hence, they introduce noise and have to be removed, which results, in other words, in a selection of the remaining.

In many applications, there are large numbers of variables, especially if L2 will be repeated, resulting in a vast number of possible selections. The number of possible combinations from the set of assignates easily exceeds 1020, and sometime even 10100. This is a larger quantity than the number of sub-atomar particles in the visible universe. The only way to find a reasonable proposal for a “good” selection is by means of an evolutionary mechanism. Formal, probabilistic approaches will fail.

The result of this step is a segmentation that can be represented as a table. The rows represent profiles of prototypes, while the columns show the selected features (for further details see below in the section about the core process)

##### Loop Level 2: Hypothetico-deductive Transformation

This step starts with a fixed segmentation based on a particular selection ℱ out of FO. The prototypes identified by L1 are the input data for a screening that employs analytic transformations of values within (mostly) pairwise selected variables, such like f(a,b) = a*b, or : f(a,b) = 1/(a+b). Given a fixed set of analytic functions, a complete screening is performed for all possible combinations. Typically, several millions of individual checks are performed.

It is very important to understand that not the original data are used as input, but instead the data on the level of the intensional description, i.e. a first-order abstraction of the data.

Once the most promising transformations have been identified, they are introduced automatically into the set of original transformations in the element T of figure 1.

##### Loop Level 3: Adaptive Sampling

see legend for figure 1

##### Loop Level 4: Re-Orientation

While the use aspects are of course already reflected by the target variable and the selected risk structure, there is a further important aspect concerning the “usage” of models. Up to level 3 the whole modeling process can be run in an autonomous manner. Yet, not so on level 4.

Level 4 and its associated loop has been included in the modeling scheme as a dedicated means for re-orientation. The results of a L3 modeling raid could lead to “insights” that change the preference structure of user. Upon this change in her/his preferences, the user could choose a different risk structure, or even a different target, perhaps also to create a further model with a complementary target.

These choices are obviously dependent on external influences such as organizational issues, or limitations / opportunities regarding the available resources.

#### Structure of the Modeling Core Process

1. Transformation ….of Data |
2. Goal-oriented.. .Segmentation | 3. Artificial Evolution | 4. Dependencies |

P = putative property (“assignate”) F = arbitrary function var = “raw” variable(s) |
profiles prototypes concepts |
combinatorial exploration of associations between variables |
complete calculation of relations as analytic functions |

**Figure 2:** Organizational elements of the modeling core process. The bottom row is showing important keywords

.

##### Transformation of Data

This step performs a purely formal, arithmetic and hence analytic transformation of values in a data table. Examples are :

- – the log-transformation of a single variable, shifting the mode of the distribution to the right, thus allowing for a better discrimination of small values; one can also use it to create missing-values in order to a adaptively filter certain values, and thus, observations;
- – combinatorial synthesis of new variables from 2+ variables, which is resulting in a stretching, warping or folding of the parameter space;
- – separating values from one variable into two new and mutually exclusive variables;
- – binning, that is reducing the scale of the variable, say from numeric to ordinal;
- – any statistical measure or procedure, changing the quality of an observation: resulting values are not reflecting observations, but instead represent a weight relative to the statistical measure.

A salient effect of the transformation of data is the increase of the number of variables. Also note, that any of those analytic transformations destroys a little bit of the total information, although it also leads to a better discriminability of certain sub-spaces of the parameter space. Most important, however, is to understand, that any analytic transformation is conceived as an hypothesis. Whether it is appropriate or not can be revealed ONLY by means of a targeted (goal-oriented) segmentation, which implies a cost-function that in turn comprises the operationalization of risk (see the chapter about generalized model).

Any of the resulting variables consist from assignates, i.e. the assigned attributes or features. Due to the transformation they comprise not just the “raw” or primary properties upon the first contact of the observer with the observed, but also all of the transformations applied to such raw properties (aka variables). This results in an extended set of assignates.

We now can also see that transformations of measured data are taking the same role as measurement devices. Initial differences in signals are received and selectively filtered according to the quasi-material properties of the device. The first step in figure 2 above such represents also what could be called generalized measurement.

Transforming data by whatsoever an algorithm or analytic method does NOT create a model. In other words, the model-aspect of statistical models is not in the statistical procedure, precisely because statistical models are not built upon associative mechanisms. The same is true for the widespread “physicalist” modeling e.g. in social sciences or urbanism. In these areas, measured data are often represented by a “formula,” i.e. a purely formal denotation, often in the form of a system of differential equations. Such systems are not by itself a model, because they are analytic rewritings of the data. The model-aspect of such formulas gets instantiated *only* through associating parts of the measured data with a target variable as an operationalization of the purpose. Without target variable, no purpose, without purpose no model, without model, no association, hence no prediction, no diagnostics, and not any kind of notion of risk. Formal approaches always need further side-conditions and premises before they can be applied. Yet, it is silly to come up with conditions for instantiations of “models” after the model has been built, since those conditions inevitably would lead to a different model. The modeling-aspect, again, is completely moved to the person(s) applying the model, hence such modeling is deeply subjective, implying serious and above all completely invisible risks regarding reproducibility and stability.

We conclude that the pretended modeling by formal methods has to be rated as bad practice.

##### Goal-oriented Segmentation

The segmentation of the data can be represented as a table. The rows represent profiles of prototypes, while the columns show the selected assignates (features); for further details see below in the section about the core process (*will be added at a future date!*).

In order to allow for a comparison of the profiles, the profiles have to be “homogenous” with respect to their normalized variance. The standard SOM tends to collect “waste” or noise in some clusters, i.e. deeply dissimilar observations are collected in a single group because their dissimilarity. Here we find one of the important modification of the standard SOM as it is widely used. The effect of this modification is of vital size. For other design issues around the Self-organizing Map see the discussion here.

##### Artificial Evolution

Necessary and even inevitable for screening the vast parameter space.

##### Dependencies

see about Loop Level 2 above.

#### Bad Habits

In the practice of modeling one can find bad habits regarding any of the elements, loops and steps outlined above. Beginning with the preparation of data there is the myth that missing values need to be “guessed” before an analysis could be done. What would be the justification for the selection of a particular method to “infer” a value that is missing due to incomplete measurement? What do people expect to find in such data? Of course, filling gaps in data before creating a model from it is deep nonsense.

Another myth, still in the early phases of the modeling process, is represented by the belief that analytical methods applied to measurement data “create” a model. They don’t. They just destroy information. As soon as we align the working of the modeling mechanism to some target variable, the whole endeavor is not analytic any more. Yet, without target variable we would not create a model, just re-written measurement values, that even don’t measure “anything”: measurement also needs a purpose. So it would be just silly first to pretend to do measurement and after that to drop that intention by removing the target variable. All of statistics works like that. Whatever statistics is doing, it is not modeling. If someone uses statistics, that person uses just a rewriting tool; the modeling itself remains deeply opaque, based on personal preferences, in short: unscientific.

People recognize more and more that clustering is indispensable for modeling. Yet, many people, particularly in biological sciences (all the -omics) believe that there is a meaningful distinction between unsupervised and supervised clustering, yet that both varieties produce models. That’s deeply wrong. One can not apply, say K-means clustering, or a SOM, without a target variable, that is a cost function, just for checking whether there is “something in the data.” Any clustering algorithm is applying some criteria to separate the observations. Why then should someone believe that precisely the more or less opaque, but surely purely formal, criteria of an arbitrary clustering algorithm should perfectly match to the data at hand? Of course, nobody should believe that. Instead of surrender oneself blindly to some arbitrary algorithmic properties one should think of those criteria as free parameters that have to be tested according to the purpose of the modeling activity.

Another widespread misbehavior concerns what is called “feature selection.” It is an abundant practice first to apply logistic regression to reduce the number of properties, then, in a completely separated second step to apply any kind of “pattern matching” approach. Of course, the logistic regression acts as a kind of filter. But: is this filter compatible to the second method, is it appropriate to the data and the purpose at hand? You will never find out, because you have applied to different methods. It is thus impossible to play the *ceteris paribus* game. It appears comprehensible to proceed according the split-method approach if you have just paper and pencil at your disposal. It is inexcusable to do so if there are computers available.

Quite to the contrast of the split-method approach one should use a single method that is able to perform feature selection AND data segmentation in the same process.

There are further problematic attitudes concerning the validation of models, especially concerning sampling and risk, which we won’t discuss here.

#### Conclusion

In this essay we are providing the first general and complete scheme for target oriented modeling. The main structural achievements comprise (1) the separation of analytic transformation, (2) associative sorting, (3) evolutionary optimization of the selection of assignates and (4) the constructive and combinatorial derivation of new assignates.

Note that * any* (computational) procedure of modeling fits into this scheme, even this scheme itself. Ultimately, any modeling results in a supervised mapping. In the chapters about the abstract formalization of models as categories we argue that models are level-2-categories.

It precisely this separation that allows for an autonomous execution of modeling once the user has determined her target and the risk that appears as acceptable. It depends completely on the context—whether external, organizational or internal and more psychological—and on individual habits how these dimensions of purpose and safety are being configured and handled.

From the perspective of our general interest in machine-based epistemology we clearly can see that target oriented modeling for itself does ** not** contribute too much to that capability. Modeling, even if creating new hypotheses, and even if we can reject the claim that modeling is an analytic activity, necessarily remains within the borders of the space determined by the purpose and the observations.

There is no particular difficulty to run even advanced modeling in an autonomous manner. Performing modeling is an almost material performance. Defining the target and selecting a risk attitude are of course not. Thus, in any predictive or diagnostic modeling the crucial point is to determine those. Particularly the risk attitude implies unrooted believes and thus the embedding into social processes. Frequently, humans even change the target in order to obey to certain limits concerning risk. Thus, in commercial projects the risk should be the only dimension one has to talk about when it comes to predictive / diagnostic modeling. Discussing about methods or tools is nothing but silly.

It is pretty clear that approaching the capability for theory-building needs more than modeling, although target oriented modeling is a necessary ingredient. We will see in further chapters how we can achieve that. The important step will be to drop the target from modeling. The result will be a pre-specific modeling, or associative storage, which serves as a substrate for any modeling that is serving a particular target.

*This article was first published 21/12/2011, last revision is from 5/2/2012*

۞

## Representation

October 24, 2011 § Leave a comment

Representation always has been some kind of magic.

Something could have been there—including all its associated power—without being physically there. Magic, indeed, and involving much more than that.

Literally—if we take the early Latin roots as a measure—it means to present something again, to place sth. again or in an emphasized style before sth. else or somebody, usually by means of placeholder, the so-called representative. Not surprising then it is closely related to simulacrum which stands for “likeness, image, form, representation, portrait.”

Bringing the notion of the simulacrum onto the table is dangerous, since it refers not only to one of the oldest philosophical debates, but also to a central one: What do we see by looking onto the world? How can it be that we trust the images produced by our senses, imaginations, apprehensions? Consider only Platon’s famous answer that we will not even cite here due to its distracting characteristics and you can feel the philosophical vortices if not twisters caused by the philosophical image theory.

It is impossible to deal here with the issues raised by the concepts of representation and simulacrum in any more general sense, we have to focus on our main subject, the possibility and its conditions for machine-based epistemology.

The idea behind machine-based epistemology is to provide a framework for talking about the power of (abstract and concrete) machines to know and to know about the conditions of that (see the respective chapter for more details). Though by “machine” we do not understand a living being here, at least not apriori, it is something produced. Let us call the producer in a simplified manner a “programmer.” In stark contrast to that, the morphological principles of living organisms are the result of a really long and contingent history of unimaginable 3.6 billion years. Many properties, as well as their generalizations, are *historical* necessities, and all properties of all living beings constitute a miraculous co-evolutionary fabric of dynamic relations. In case of the machine, there are only little historic necessities, for the good and the bad. The programmer has to define necessities, the modality of senses, the chain of classifications, the kind of materiality etc.etc. Among all these decisions there is one class that is predominantly important:

*How to represent external entities?*

Quite naturally, as “engineers” of cognitive machines we can not really evade the old debate about what is in our brains and minds, and what’s going on there while we are thinking, or even just recognizing a triangle as a triangle. Our programmer could take a practical stance to this question and reformulate it as: How could she or he achieve that the program will recognize *any* triangle?

It needs to be able to distinguish it from any other figure, even the program never has been confronted with an “ideal” template or prototype. It also needs to identify quite incorrect triangles, e.g. from hand drawings, as triangles. It even should be able to identify virtual figures, which exist only in their negativity like the Kanizsa-triangle. For years, computer scientists proposed logical propositions and shape grammars as a solution—and failed completely. Today, machine learning in all its facets is popular, of course. This choice alone, however, is not yet the solution.

The new questions then have been (and still are): What to present to the learning procedure? How to organize the learning procedures?

Here we have to care about a threatening misunderstanding, actually of two misunderstandings, heading from opposite directions to the concept of “data.” Data are of course not “just there.” One needs a measurement device, which in turn is based on a theory, then on a particular way to derive models and devices from that theory. In other words, data are dependent on the culture. So far, we agree with Putnam about that. Nevertheless, given the body of a cognitive entity, that entity, whether human, animal or machine, finds itself “gestellt” into a particular actuality of measurement in any single situation. The theory about the data is apriori, yet within the particular situation the entity finds “raw data.” Both, theory and data impose severe constraints on what can be perceived by or even known to the cognitive entity. Given the data, the cognitive entity will try to construct diagnostic / predictive models, including schemes of interpretations, theories, etc. The important question then is concerned about the relationship between apriori conditions regarding the cognitive entity and the possibly derived knowledge.

On the other hand, we can defend us against the second misunderstanding. Data may be conceived as (situational) “givens”, as the Latin root of the word suggests. Yet, this givenness is not absolute. Somewhat more appropriate, we may conceive data as intermediate results of transformations. This renders any given method into some kind of abstract measurement device. The label of “data” we usually just use for those bits whose conditions of generation we can not influence.

Consider for instance a text. For the computer a text is just a non-random series of graphemes. We as humans can identify a grammar in human languages. Many years, if not decades, people thought that computers will understand language as soon as grammar has been implemented. The research by Chomsky [1], Jackendoff [2] and Pinker [3], among others, is widely recognized today, resulting in the concepts of phrase structure grammar, x-bar syntax or head-driven syntax. Yet, large research projects with hundreds of researchers (e.g. “verbmobil”) did not only not reach the self-chosen goals, they failed completely on the path to implement understanding of language. Even today, for most languages there is no useful parser available, the best parser for German language achieves around 85-89% accuracy, which is disastrous for real applications.

Another approach is to bring in probabilistic theories. Particularly n-grams and Markov-models have been favored. While the first one is an incredibly stupid idea for the representation of a text, Markov-models are more successful. It can be shown, that they are closely related to Bayes belief networks and thus also to artificial neural networks, though the latter employ completely different mechanism as compared to Markov-models. Yet, from the very mechanism and the representation that is created as/by the Markov-model, it is more than obvious that there is no such thing as language understanding.

Quite obviously, language as text can not be represented as a grammar plus a dictionary of words. Doing so one would be struck by the “representational fallacy,” which not only has been criticized by Dreyfus recently [4], it is a matter of fact that representationalist in machine learning approaches failed completely. Representational cognitivism claims that we have distinct image-like engrams in our brain when we are experiencing what we call thinking. They should have read Wittgenstein first (e.g. About Certainty), before starting expensive research programs. That experience about one’s own basic mental affairs is as little directly accessible as any other thing we think or talk of. A major summary of many subjections against the representationalist stance in theories about the mind, as well as a substantial contribution is Rosenfield’s “The Invention of Memory” [6]. Rosenfield argues strongly against the concept of “memory as storage,” in the same venue as Edelman, to which we fully agree.

It does not help much either to resort to “simple” mathematical or statistical models, i.e. models effectively based on an analytical function, as apposed to models based on a complex system. Conceiving language as a mere “random process” of whatsoever kind simply does not work, let it be those silly n-grams, or sophisticated Hidden Markov Models. There are open source packages in the web you can use to try it yourself.

But what then “is” a text, how does a text unfold its effects? Which aspects should be presented to the learning procedure, the “pattern detection engine,” such that the regularities could be appropriately extracted and a re-presentation could be built? Taking semiotics into account, we may add links between words. Yet, this involves semantics. Peter Janich has been arguing convincingly that the separation of syntax and semantics should be conceived of as just another positivist/cyberneticist myth [5]. And on which “level” should links be regarded as significant signals? If there are such links, any text renders immediately into a high-dimensional non-trivial and above all dynamic network…

An interesting idea has been proposed by the research group around Teuvo Kohonen. They invented a procedure they call the WebSom [7]. You can find material in the web about it, else we will discuss it in great detail within our sections devoted to the SOM. There are two key elements of this approach:

- (1) It is a procedure which inherently abstracts from the text.
- (2) the text is not conceived—and (re-)presented—as “words”, i.e. distinct lexicographical primitives; instead words are mapped into the learning procedure as a weighted probabilistic function of their neighborhood.

Particularly seminal is the second of the key properties, the *probabilization* into overlapping neighborhoods. While we usually think that words a crisp entities arranged into a structured series, where the structure follows a grammar, or is identical with it, this is not necessarily appropriate, even not for our own brain. The “atom” of human language is most likely not the word. Until today, most (if not all people engaged in computer linguistics) think that the word, or some very close abstraction of it, plus some accidentia, forms the basic entities, the indivisible of language.

We propose that this attitude is utterly infected by some sort of pre-socratic and romantic cosmology, geometry and cybernetics.We even can’t know which representation is the “best”, or even an appropriate one. Even worse, the appropriateness of the presentation of raw data to the learning procedure via various pre-processors and preparation of raw data (series of words) is not independent from the learning procedure. We see that the problems with presentation and representation reach far into the field of modeling.

Despite we can’t know in principle how to perform measurements in the most appropriate manner, as a matter of fact we will perform some form of measurement. Yet, this initial “raw data” does not “represent” anything, even not the entity being subject of the measurement. Only a predictive model derived from those observations can represent an entity, and it does so only in a given context largely determined by some purpose.

Whatsoever such an initial and multiple presentation of an entity will look like, it is crucial, in my opinion, to use a proababilized preparation of the basic input data. Yet, components of such preparations not only comprise the raw input data, but also the experience of the whole engine, i.e. a kind of semantic influence, acquired by learning. Further (potential) components of a particular small section of a text, say a few words, are any kind of property of the embedding text, of any extent. Not only words as lexemes, but also words as learned entities, as structural elements, then also sentences and their structural (syntactical)) properties, semantic or speech-pragmatic markers, etc.etc. and of course also including a list of properties as Putnam proposed already in 1979 in “The meaning of “Meaning” [8].”

Taken together we can state that the input to the association engine are probabilistic distributions about arbitrarily chosen “basic” properties. As we will see in the chapter on modeling, these properties are not to be confused with objective facts to be found in the external world. There we also will see how we can operationalize these insights into implementation. In order to enable a machine to learn how to use words as items of a language, we should not present words in their propositional form to it. Any entity has to be measured as a entity from a random distribution and represented as a multi-dimensional probability distribution. In other words, we deny the possibility to transmit any particular representation into the machine (or another mind as well). A particular manifold of representations has to built up by the cognitive entity itself in direct response to requirements of the environment, which is just to be conceived as the embedding for “situations.” In the modeling chapter we will provide arguments for the view that this linkage to requirements does not result in behavioristic associativism, the simple linkage between simulus and response according to the framework proposed by Watson and Pawlow. Target-oriented modeling in the multi-dimensional case necessarily leads to a manifold of representations. Not only the input is appropriately described by probability distributions, but also the output of learning.

And where is the representation of the learned subject? How does it look like? This question is almost sense-free, since it would require to separate input, output, processing, etc. it would deny the inherent manifoldness of modeling, in short, it is a deeply reductionist question. The learning entity is able to behave, react, anticipate, and to measure, hence just the whole entity is the representation.

The second important anatomical property of an entity able to acquire the capability to understand texts is the inherent abstraction. Above all, we should definitely not follow the flat world approach of the positivist ideology. Note, that the programmer should not only not build a dictionary into the machine; he also should not pre-determine the kind of abstraction the engine develops. This necessary involves internal differentiation, which is another word for growth.

- [1] Noam Chomsky (to be completed…)
- [2] Jackendoff
- [3] Steven Pinker 1994?
- [4] Hubert L Dreyfus, How Representational Cognitivism Failed and is being replaced by Body/World Coupling. p.39-74, in: Karl Leidlmair (ed.), After Cognitivism: A Reassessment of Cognitive Science and Philosophy, Springer, 2009.
- [5] Peter Janich. 2005.
- [6] Israel Rosenfield, The Invention of Memory: A New View of the Brain. New York, 1988.
- [7] WebSom
- [8] Hilary Putnam, The Meaning of “Meaning”. 1979.

۞