## Similarity

December 30, 2011 § 1 Comment

Similarity appears to be a notoriously inflationary concept.

Already in 1979 a presumably even incomplete catalog of similarity measures in information retrieval listed almost 70 ways to determine similarity [1]. In contemporary philosophy, however, it is almost absent as a concept, probably because it is considered merely as a minor technical aspect of empiric activities. Often it is also related to naive realism,which claimed a similarity between a physical reality and concepts. Similarity is also a central topic in cognitive psychology, yet not often discussed, probably for the same reasons as in philosophy.

In both disciplines, understanding is usually equated with drawing conclusions. Since the business of drawing conclusions and describing the kinds and surrounds of that is considered to be the subject of logic (as a discipline), it is comprehensible that logic has been rated by many practitioners and theoreticians alike as the master discipline. While there is a vivid discourse about logical aspects for many centuries now, the role of similarity is largely neglected, and where vagueness makes its way to the surface, it is “analyzed” completely within logic. Also not quite surprising, artificial intelligence focused strongly on a direct link towards propositional logic and predicate calculus for a long period of time. This link has been represented by the programming language “Prolog,” an abbreviation standing for “programming in logic.” It was established in the first half of the 1970ies by the so-called Edinburgh-school. Let us just note that this branch of machine-learning disastrously failed. Quite remarkably, the generally attested reason for this failure has been called the “knowledge acquisition bottleneck” by the community suffering from it. Somehow the logical approach was completely unsuitable for getting in touch with the world, which actually is not really surprising for anyone who understood Wittgenstein’s philosophical work, even if only partially. Today, the logic oriented approach is generally avoided in machine-learning.

As a technical aspect, similarity is abundant in the so-called field of data mining. Yet, there it is not discussed as a subject in its own rights. In this field, as represented by the respective software tools, rather primitive notions of similarity are employed, importing a lot of questionable assumptions. We will discuss them a bit later.

There is a particular problematics with the concept of similarity, that endangers many other abstract terms, too. This problematics appears if the concept is equated with its operationalization. Sciences and engineering are particularly prone for the failure to be aware of this distinction. It is an inevitable consequence of the self-conception of science, particularly the hypothetico-deductive approach [cf. 2], to assign ontological weight to concepts. Nevertheless, such assignment always commits a naturalization fallacy. Additionally, we may suggest that ontology itself is a deep consequence of an overly scientific, say: positivistic, mechanic, etc. world view. Dropping the positivistic stance removes ontology as a relevant attitude.

As a consequence, science is not able to reflect about the concept itself. What science can do, regardless the discipline, is just to propose further variants as representatives of a hypothesis, or to classify the various proposed approaches. This poses a serious secondary methodological problematics, since it equally holds that there is no science without the transparent usage of the concept of similarity. Science should control free parameters of experiments and their valuation. Somewhat surprisingly, almost the “opposite” can be observed. The fault is introduced by statistics, as we will see, and this result really came as a surprise even for me.

A special case is provided by “analytical” linguistics, where we can observe a serious case of reduction. In [3], the author selects the title “Vagueness and Linguistics,” but also admits that “In this paper I focused my discussion on relative adjectives.” Well, vagueness can hardly be restricted to anything like relative adjectives, even in linguistics. Even more astonishing is the fact that similarity does not appear as a subject at all in the cited article (except in a reference to another author).

In the field engaged in the theory of the metaphor [cf. 4, or 5], one can find a lot of references to similarity. In any case known to me it is, however, regarded as something “elementary” and unproblematic. Obviously neither extra-linguistic modeling nor any kind of inner structure of similarity is recognized as important or even as possible. No particular transparent discourse about similarity and modeling is available from this field.

From these observations it is possible in principle to derive two different, and mutually exclusive conclusions. First, we could conclude that similarity is irrelevant for understanding phenomena like language understanding or the empirical constitution. We don’t believe in that. Second, it could be that similarity represents a blind spot across several communities. Therefore we will try to provide a brief overview about some basic topics regarding the concept of similarity.

#### Etymology

Let us add some etymological considerations for a first impression. Words like “similar,” “simulation” or “same” derive all from proto-indoeuropean (PIE) base “*sem-/*som-“, which meant “together, one”, in Old English then “same.” Yet, there is also the notion of “simulacrum” in the “same cloud”; the simulacrum is a central issue in the earliest pieces of philosophy of which we know (Platon) in sufficient detail.

The German word “ähnlich,” being the direct translation of “similar,” derives from Old German (althochdeutsch, well before ~1050 a.c.) “anagilith” [6], a composite from an- and gilith, meaning together something like “angleichen,” for which in English we find the words adapt, align, adjust or approximate, but also “conform to” or “blend.” The similarity to “sema” (“sign”) seems to be only superficial; it is believed that sema derives from PIE “dhya” [7].

If some items are said to be “similar,” it is meant that they are not “identical,” where identical means indistinguishable. To make them (virtually) indistinguish- able, they would have to be transformed. Even from etymology we can see that similarity needs an activity before it can be attested or assigned. Similarity is nothing to be found “there,” instead it is something that one is going to produce in a purposeful manner. This constructivist aspect is quite important for our following considerations.

#### Common Usage of Similarity

In this section, we will inspect the usage of the concept of “similarity” in some areas of particular relevance. We will visit cognitive psychology, information theory, data mining and statistical modeling.

##### Cognitive Psychology

Let us start with the terminology that has been developed in cognitive psychology, where one can find a rich distinction of the concept of similarity. It started with the work of Tversky [8], while Goldstone provides a useful overview more recently [9].

Tversky, a highly innovative researcher on cognition, tried to generalize the concept of similarity. His intention is to overcome the typical weakness of “geometric models”, which “[…] represent objects as points in some coordinate space such that the observed dissimilarities between objects correspond to the metric distances between the respective points.”1 The major assumption (and drawback) of geometric models is the (metric) representability in coordinate space. A typical representative of “geometric models” as Tversky calls them employs the nowadays widespread Euclidean distance as an operationalization for similarity.

A new set-theoretical approach to similarity is developed in which objects are represented as collections of features, and similarity is described as a feature-matching process. Specifically, a set of qualitative assumptions is shown to imply the contrast model, which expresses the similarity between objects as a linear combination of the measures of their common and distinctive features.

Tversky’s critique in the “geometrical approach” applies only if two restrictions are active: (1) If one would disregard missing values, which actually is the case for most of the practices. (2) If the dimensional interpretation is considered to be stable and unchangeable, no folding or warping of the data space via transformation of the measured data will be applied.

Yet, it is neither necessary to disregard missing values in a feature-based approach nor to dismiss dimensional warping. Here Tversky does not differentiate between form of representation and the actual rule for establishing the similarity relation. This conflation is quite abundant in many statements about similarity and its operationalization.

What Tversky effectively has been proposing is now known as binning. The approach by him is based on features, though in a way quite different (at first sight) from our proposal, as we will show below. Yet, not the values of the features are compared, but instead the two sets on the level of the items by means of a particular ratio function. In a different perspective, the data scale used for assessing similarity is reduced to the ordinal or even the nominal scale. Tversky’s approach thus is prone to destroy information present in the “raw” signal.

An attribute (Tversky’s “feature”) that occurs in different grades or shades is translated into a small set of different, distinct and mutually exclusive features. Tversky obviously does not recognize that binning is just one out of many, many possible ways to deal with observational data, i.e. to transform it. Applying a particular transformation based on some theory in a top-down manner is equivalent to the claim that the selected transformation builds a perfect filter for the actually given data. Of course, this claim is deeply inadequate (see the chapter about technical aspects of modeling). Any untested, axiomatically imposed algorithmic filter may destroy just those pieces of information that would have been vital to achieve a satisfying model. One simply *can’t* know before.

Tversky’s approach built on feature sets. The difference of those sets (on a nominal level) should represent the similarity and are expressed by the following formula:

s(a,b) = F(A ∩ B, A-B, B-A). eq.1

which Tversky describes in the following way:

The similarity of a to b is expressed as a function F of three arguments: A ∩ B, the features that are common to both a and b; A-B, the features that belong to a but not to b; B-A, the features that belong to b but not to a.

This formula reflects also what he calls “contrast.” (It is similar to Jaccard’s distance, so to speak, an extended practical version of it) Yet, Tversky, like any other member of the community of cognitive psychologist referring to this or a similar formula, did not recognize that the features, when treated in this way, are all equally weighted. It is a consequence of sticking to set theory. Again, this is just the fallback position of initial ignorance of the investigator. In the real world, however, features are differentially weighted, building a context. In the chapter about the formalization of the concept of context we propose a more adequate possibility to think about feature sets, though our concept of context shares important aspects with Tversky’s approach.

Tversky emphasizes that his concept does not consist from just one single instance or formula. He introduces weighting factors for the terms of eq.1, which then leads to families of similarity functions. To our knowledge this is the only instance (besides ours) arguing for a manifold regarding similarity. Yet, again, Tversky still does not draw the conclusion, that the chosen instance of a similarity “functional” (see below) has to be conceived just a hypothesis.

In cognitive psychology (even today), the term “feature-based models” of similarity does *not* refer to feature vectors as they are used in data mining, or even generalized vectors of assignates, as we proposed it in our concept of the generalized model. In Tversky’s article this becomes manifest on p.330. Contemporary psychologists like Goldstone [9] distinguish four different ways of operationalizing similarity: (1) geometric, (2) feature-based, (3) alignment-based, and (4) transformational similarity. For Tversky [8] and Goldstone, the label “geometric model” refers to models based on feature vectors, as they are used in data mining, e.g. as Euclidean distance.

Our impression is that cognitive psychologist fail to think in an abstract enough manner about features and similarity. Additionally, it seems that there is a tendency to the representationalist fallacy. Features are only recognized as features as far as they appear “attached” to the object for human senses. Dropping this attitude it becomes an easy exercise to subsume all those four types in a feature-vector approach, that (1) allows for missing values and assigns them a “cost”, and which (2) is not limited to primitive distance functions like Euclidean or Hamming distance. The underdeveloped generality is especially visible concerning the alignment or transformational subtype of similarity.

A further gap in the similarity theory in cognitive psychology is the missing separation between the operation of comparison and the operationalization of similarity as a projective transformation into a 0-dimensional space, that is a scalar (a single value). This distinction is vital, in our opinion, to understand the characteristics of comparison. If one does not separate similarity from comparison, it becomes impossible to become aware of higher forms of comparison.

##### Information theory

A much more profound generalization of similarity, at least at first sight, has been proposed by Dekang Lin [10], which is based on an “information-theoretic definition of similarity that is applicable as long as there is a probabilistic model.” The main advantage of this approach is its wide applicability, even in cases where only coarse frequency data are available. Quite unfortunately, Lin’s proposal neglects a lot of information if there are accurate measurements in the form of feature-vectors. Besides the comparison of strings and statistical frequency distributions, Lin’s approach is applicable to sets of features, but not to profile-based data, as we propose for our generalized model.

##### Data Mining

Data mining is an distinguished set of tools and methods that are employed in a well-organized manner in order to facilitate the extraction of relevant patterns [11], either for predictive or for diagnostic purposes. Data Mining (DM) is often conceived as a part of so-called “knowledge discovery,” building the famous abbreviation KDD: knowledge discovery in databases [11]. In our opinion, the term “data mining” is highly misleading, and “knowledge discovery” even deceptive. In contrast to earthly mining, in the case of information the valuable objects are not “out there” like minerals or gems, while knowledge can’t be “discovered” like diamonds or physical laws. Even the “retrieval” of information is impossible by principle. To think otherwise dismisses the need of interpretation and hence contradicts widely acknowledged positions in contemporary epistemology. One has to know that the terms “KDD” and “data mining” are shallow marketing terms, coined to release the dollars of naive customers. Yet, KDD and DM are myths many people believe in and which are reproduced in countless publications. As concepts, they simply remain to be utter nonsense. As a non-sensical practice that is deeply informed by positivism, it is harmful for society. It is more appropriate to call the respective activity more down-to-earth just *diagnostic* or *predictive modeling* (which actually is equivalent).

Any observation of entities takes place along apriori selected properties, often physical ones. This selection of properties is part of the process of creating an operationalization, which actually means to make a concept operable through making it measurable. Actually, those properties are not “natural properties of objects.” Quite to the contrast, objecthood is created by the assignment of a set of features. This inversion is often overlooked in data mining projects, and consequently also the eminently constructive characteristics of data-based modeling. Hence, it is likewise also not correct to call it “data analysis”: an analysis does not add anything. Predictive/ diagnostic models are constructed and synthesized like small machines. Models may well be conceived as an informational machinery. To make our point clear: nobody among the large community of machine-building engineers would support the view that any machine comes into existence just through “analysis.”

Given the importance of similarity in comparison, it is striking to see that in many books about data mining the notion of “similarity” does not appear even a single time [e.g. 12], and in many more publications only in a very superficial manner. Usually, it is believed that the Euclidean distance is a sound, sufficient and appropriate operationalization of similarity. Given its abundance, we have to take a closer look to this concept, how it works, and how it fails.

##### Euclidean Distance and its Failure

We already met the idea that objects are represented along a set of selected features. In the chapter about comparison we saw that in order to compare items of a population of objects, those objects are to be compared on the basis of a selected and shared feature set. Next, it is clear that for each of the features some values can be measured. For instance, presence could be indicated by the dual pair of values 1/0. For nominal values like names re-scaling mechanisms have been proposed [13]. Such, any observation can be transformed into a table of values, where the (horizontal) rows represent the objects and the columns describe the features.

We also can say that any of the objects contained in such a table is represented by a profile. Note that the order of the columns (features) is arbitrary, but it is also constant for all of the objects covered by the table.

The idea now is that each of the columns represent a dimension in a Cartesian, orthogonal coordinate system. As a preparatory step, we normalize the data, i.e. for each single column the values contained in it are scaled such that the ratios remain unchanged, but the absolute values are projected into the interval [0..1].

By means of such a representation any of the objects (=data rows) can be conceived as a particular point in the space spanned by the coordinate system. The similarity S then is operationalized as the “inverse” of the distance, S=1-d, between any of the points. The distance can be calculated according to the Euclidean formula for the length of the hypotenuse in the orthogonal triangle (2d case). In this way, the points are understood as the endpoint of a vector that starts in the origin of the coordinate system. Thus, this space is often called “data space” or “vector space.” The distance is called “Euclidean distance.”

Since all of the vectors are within the unit sphere (any value is in [0..1]), there is another possibility for an operationalization of the similarity. Instead of the distance one could take the angle between any two of those vectors. This yields the so-called cosine-measure of (dis-)similarity.

Besides the fact that missing values are often (and wrongly) excluded from a feature-vector-based comparison, this whole procedure has a serious built-in flaw, whether as cosine- or as Euclidean distance.

The figure 1a below shows the profiles of two objects above a set of assignates (aka attributes, features, properties, fields). The embedding coordinate space has k dimensions. One can see that the black profile (representing object/case A) and the red profile (representing object/case B) are fairly similar. Note that within the method of the Euclidean distance all ai are supposed to be independent from each other.

**Figure 1a:** Two objects A’ and B’ has been represented as profiles A, B across a shared feature vector ai of size k ;

Next, we introduce a third profile, representing object C. Suppose that the correlation between profiles A and C is almost perfect. This means that the inner structure of objects A and C could be considered to be very similar. Some additional factor just might have damped the signal, such all values are proportionally lower by an almost constant ratio when compared to values measured from object A.

**Figure 1b:** Compared to figure 1a, a third object C’ is introduced as a profile C; this profile causes a conflict about the order that should be induced by the similarity measure. There are (very) good reasons, from systems theory as well as from information theory, to consider A and C more similar to each other than either A-B or B-C. Nevertheless, employing Euclidean distance will lead to a different result, rating the pairing A-B as the most similar one.

The particular difficulty now is given by the fact, that it depends on some objections that are completely outside of the chosen operationalization of similarity, which two pairs of observations are considered more similar to each other. Yet, this dependency inverses the intended arrangement of the empiric setup. The determination of the similarity actually should be used to decide about those outside objections. Given the Euclidean distance, A and B are clearly much more similar to each other than either A-C or B-C. Using in contrast a correlative measure would select A-C as the most similar pairing. This effect gets more and more serious the more assignates are used to compare the items.

Now imagine that there are many observations, dozens, hundreds or hundreds of thousands, that serve as a basis for deriving an intensional description of all observations. It is quite obvious that the final conclusions will differ drastically upon the selection of the similarity measure. The choice of the similarity measure is by no means only of technical interest. The particular problematics, but also, as we will see, the particular opportunity that is related to the operationalization of similarity consists in the fact that there is a quite short and rather strong link between a technical aspect and the semantic effect.

Yet, there are measures that reflect the similarity of the form of the whole set of items more appropriately, such like least-square distances, or measures based on correlation, like the Mahalanobis distance. However, these analytic measures have the disadvantage of relying to certain global parametric assumptions, such as normal distribution. And they do not completely resolve the situation shown in figure 1b even in theory.

We just mentioned that the coherence of value items may be regarded as a form. Thus, it is quite natural to use a similarity measure that is derived from geometry or topology, which also does not suffer from any particular analytic apriori assumption. One such measure is the Hausdorff metric, or more general the Gromov-Hausdorff metric. Being developed in geometry they find their “natural” application in image analysis, such as partial matching of patterns to larger images (aka finding “objects” in images). For the comparison of profiles we have to interpret them as figures in a 2-dimensional space, with |ai|/2 coordinate points. Two of such figures are then prepared to be compared. The Hausdorff distance is also quite interesting because it allows to compare whole sets of observations, not only as two paired observations (profiles) interpreted as coordinates in ℝ2, but also three observations as ℝ3, or a whole set of n observations, arranged as a table, as a point cloud in ℝn. Assuming compactness, i.e. a macroscopic world without gaps, we may interpret them also as curves. This allows to compare whole sub-sets of observations *at once*, which is a quite attractive feature for the analysis of relational data. As far as we know, nobody ever used the Hausdorff metric in this way.

Epistemologically, it is interesting that a topologically inspired assessment of data provides a formal link between feature-based observations and image processing. Maybe, this is relevant for the subjective impression to think in “images,” though nobody has ever been able to “draw” such an image… This way, the idea of “form” in thought could acquire a significant meaning.

Yet, already in his article published more than 30 years ago, Tversky [8] mentioned that the metric approach is barely convincing. He writes (p.329)

The applicability of the dimensional assumption is limited, […] minimality is somewhat problematic, symmetry is apparently false, and the triangle inequality is hardly compelling.

It is of utmost importance to understand that the selection of the similarity measure as well as the selection of the feature to calculate it are by far the most important factors in the determination, or better hypothetical presupposition, of the similarity between the profiles (objects) to be compared. The similarity measure and feature selection is by far more important than the selection of a particular method, i.e. a particular way of organizing the application of the similarity measure. Saying “more important” also means that the differences in the results are much larger between different similarity measures than between methods. From a methodological point of view it is thus quite important that the similarity measure is “accessible” and not buried in a “heap of formula.”

Similarity measures that are based only on dimensional interpretation and coordinate spaces are not able to represent issues of form and differential relations, what is also (and better) known as “correlation.” Of course, other approaches different from correlation that would reflect the form aspect of the internal relations of a set of variables (features) would do the job, too. We just want to emphasize that the assumption of perfect independence among the variables is “silly” in the sense that it contradicts the “game” that the modeler actually pretends to play. This leads more often than not to irrelevant results. The serious aspect about this is, however, given by the fact that this deficiency remains invisible when comparing results between different models built according to the Euclidean dogma.

There is only one single feasible conclusion from this: Similarity can’t be regarded as property of actual pairings of objects. The similarity measure is a free parameter in modeling, that is, nothing else than a hypothesis, though on the structural level. As a hypothesis, however, it needs to be tested for adequacy.

##### Similarity in Statistical Modeling

In statistical modeling the situation is even worse. Usually, the subject of statistical modeling is not the individual object or its representation. The reasoning in statistical modeling is strongly different from modeling in predictive modeling. Statistics compares populations, or at least groups as estimates of populations. Dependent on the scale of the data, the amount of data, the characteristics of data and the intended argument a specialized method has to be picked from a large variety of potential methods. Often, the selected method also has to be parameterized. As a result, the whole process of creating a statistical model is more “art” than science. Results of statistical “analysis” are only approximately reproducible across analysts. It is indeed kind of irony that at the heart of quantitative science one finds a non-scientific methodological core.

Anyway, our concern is similarity. In statistical modeling there is no similarity function visible at all. All that one can see is the result and proposals like “population B is not part of population A with a probability for being false positive of 3%.” Yet, the final argument that populations can be discerned (or can’t) is obviously also an argument about the probability for a correct assignment of the members of the compared populations. Hence, it is also clearly an argument about the group-wise as well as the individual similarity of the objects. The really bad thing is the similarity function is barely visible at all. Often it is some kind of simple difference between values. The main point is that it is not possible to parametrize the hidden similarity function, except by choosing the alpha level for the test. It is “distributed” across the whole procedure of the respective method. In its most important aspect, any of the statistical methods has to be regarded as a black box.

These problems with statistical modeling are prevalent even across the general framework, i.e. whether one chooses a frequentist or a Bayesian attitude. Recently, Alan Hajek [14] proofed that statistics is a framework that in all its flavors suffers from the reference class problem. Cheng [15] correctly notes about the reference class problem that

“At its core, it observes that statistical inferences depend critically on how people, events, or things are classified. As there is (purportedly) no principle for privileging certain categories over others, statistics become manipulable, undermining the very objectivity and certainty that make statistical evidence valuable and attractive …”

So we can see that the reference class problem is just a corollary of the fact that the similarity function is not given explicitly and hence also is not accessible. Thus, Cheng seeks unfulfillable salvation by invoking the cause of defect itself: statistics. He writes

I propose a practical solution to the reference class problem by drawing on model selection theory from the statistics literature.

Despite he is right in pointing to the necessity of model selection, he fails to recognize that statistics can’t be helpful in this task. We find it interesting that this author (Cheng) has been writing for the community of law theoreticians. This sheds bright light onto the relevance of an appropriate theory of modeling.

As a consequence we conclude that statistical methods should not be used as the main tool for any diagnostic/predictive modeling of real-world data. The role of statistical methods in predictive/diagnostic modeling is just the same as that of any other transformation: they are biased filters, whose adequacy has to be tested, nothing less, and, above all, definitely nothing more. Statistics should be used only within completely controllable, hence completely closed environments, such like simulations, or “data experiments.”

#### The Generalized View

Before we are going to start we would like to recall the almost trivial aspect that the concept of similarity makes sense exclusively in the context of diagnostic/ predictive modeling, where “modeling” refers to the generalized model, which in turn is part of a transcendental structure.

After having briefly discussed the relation of the concept of similarity to some major domains of research, we now may turn to the construction/description of a proper concept of similarity. The generalized view that we are going to argue for should help determining the appropriate mode of speaking about similarity.

##### Identity

Identity is often seen as the counterpart of similarity, or also as some kind of a simple asymptotical limit to it. Yet, these two concepts are so deeply incommen-surable that they can not be related at all.

One could suggest that identity is a relation that indicates a particular result of a comparison, namely indistinguishability. We then also could say that under any possible transformation applied to identical items the respective item remain indistinguishable. Yet, if we compare two items we refer to the concept of similarity, from which we want to distinguish it. Thus it is clear that identity and similarity are structurally different. There is no way from one to the other.

In other words, the language game of identity excludes any possibility for a comparison. We can set it only by definition, axiomatically. This means that not only the concepts can’t be related to each other, additionally he see that the subjects of the two concepts are categorically different. Identity is only meaningful as an axiomatically introduced equality of symbols.

In still other words we could say that identity is restricted to axiomatically defined symbols in formal surrounds, while similarity is applicable only in empirical contexts. Similarity is not about symbols, but about measurement and the objects constructed from it.

This has profound consequences.

First, identity can’t be regarded as a kind of limit to which similarity would asymptotically approximate. For any two objects that have been rated as being “equal,” notably through some sort of comparison, it is thus possible to find a perspective under which they are not equal any more.

Second, it is impossible to take an existential stance towards similarity. Similarity is a the result of an action, of a method or technique that is embedded in a community. Hence it is not possible to assign similarity an ontic dimension. Similarity is not part of any possible ontology.

We can’t ask “What is similarity?”, we also can not even pretend to determine “the” similarity of two subjects. “Similarity” is a very particular language game, much like its close relatives like vagueness. We only can ask “How to speak about similarity?”

Third, it is now clear that there is no easy way from a probabilistic description to a propositional reference. We already introduced this in another chapter, and we will deal dedicatedly elsewhere with it. There is no such transition within a single methodology. We just see again how far Wittgenstein’s conclusion about the relation of the world and logic is reaching. The categorical separation between identity and similarity, or between the empiric and the logic can’t be underestimated. For our endeavor of a machine-based epistemology it is of vital interest to find a sound theory for this transition, which in any of the relevant research areas has not been even recognized as a problematic subject so far.

##### Practical Aspects

Above we have seen that any particular similarity measure should be conceived as part of a general hypothesis about the best way to create an optimized model. Within such a hypothesizing setting we can distinguish two domains embedded into a general notion of similarity. We could align these two modes to the distinction Peirce introduced with regard to uncertainty: probability and verisimilitude [16]. Such, the first domain regards the variation along assignates that are shared among two items. From the perspective of any of the compared items, there is complete information about the extension of the world of that item. Any matter is a matter of degree and probability, as Peirce understood it. Taking the perspective of Deleuze, we could call it also possibility.

The second domain is quite different. It is concerned with the difference of the structure of the world as it is accessible for each of the compared items, where this difference is represented by a partial non-matching of the assignates that provide the space for measurement. Here we meet Tversky’s differential ratio that builds upon differences in the set of assignates (“features,” as he called it) and can be used also to express differential commonality.

Yet, the two domains are not separated from each other in an absolute manner. The logarithm is a rather simple analytic function with some unique properties. For instance, it is not defined for argument values [-∞..0]. The zero (“0”), however, in turn can be taken to serve as a double articulation that allows to express two very different things: (1) through linear normalization the lowest value of the range, and (2) the (still symbolic) absence of a property. Using the logarithm then, the value “0” gets transformed into a missing value, because the logarithm is not defined for arg=0, that is, we turn the symbolic into a quasi-physical absence. The *extremely* (!) valuable consequence of this is that by means of the logarithmic transformation we can change the feature vector on the fly in a context dependent manner, where “context” (i) can denote any relation between variables of values therein, and (ii) may be related to certain segments of observations. Even the (extensional) items within a (intensional) class or empirical category may be described by dynamically regulated sets of assignates (features). In other words, the logarithmic transformation provides a plain way towards abstraction. Classes as clusters are not just comprising items homogenized by the identical feature set. Hence, it is a very powerful means in predictive modeling.

Given the two domains in the practical aspects of similarity measures it is now becoming more clear that we indeed need to insist on a separation of assignates and the mapping similarity function, as we did in the chapter about comparison. We reproduce Figure 2b from that chapter:

Figure 2: Schematic representation of the comparison of two items. Items are compared along sets of “attributes,” which have to be assigned to the items, indicated by the symbols {a} and {b}.

The set of assignates symbolized as {a} or {b} for items A, B don’t comprise just the “observable” raw “properties.” Of course, all those properties are selected and assigned by the observer, which results in the fact that the observed items are literally established only through this measurement step. Additionally, the assigned attributes, or better “assignates,” comprise also all transformations of raw , primary assignates, building two extended sets of assignates. The similarity function then imposes a rule for calculating the scalar (a single value) that finally serves as a representation of the respective operationalization. This function may represent any kind of mapping between the extended set of assignates.

Such a mapping (function) could consists of a compound of weighted partial functions, according to the proposals of Tversky or Cheng and a particular profile mapping. Sets {a} and {b} need not be equal, of course. One could even apply the concept of formalized contexts instead of a set of equally weighted items. Nevertheless, there remains the apriori of the selection of the assignates, that precedes the calculation of the scalar. In practical modeling this selection will almost for sure lead to a removal of most of the measured “features.”

Above we said that any similarity measure must be considered as a free parameter in modeling, that is, as nothing else than a hypothesis. For the sake of abstraction and formalization this requires that we generalize the single similarity function into a family of functions, which we call “functional.” In category theoretic terms we could call it also a “functor.” The functor of all similarity functions then would be part of the functor representing the generalized model.

##### Formal Aspects

In the chapter about the category of models we argued that models can not be conceived in a set theoretic framework. Instead, we propose to describe models and the relations among them on the level of categories, namely the category of functors. In plain words, models are *transformative relations*, or in a term from category theory, arrows. Similarity is a dominant part of those transformative relations.

Before this background (or: within this framing), we could say that similarity is a property of the arrow, while a particular similarity function represents a particular transformation. By expressing similarity as a value we effectively map these properties of the arrow to an scalar, which could be a touchable value or which could be an abstract scalar. Even more condensed, we could say that in a general perspective:

*Similarity can be conceived as a mapping of relations onto a scalar.*

This scalar should not be misunderstood as the value of the categorical “arrow.” Arrows in category theory are not vectors in a coordinate system. The assessment of similarity thus can’t be taken just as kind of a simple arithmetic transformation. As we already said above from a different perspective, similarity is not a property of objects.

Since similarity makes sense only in the context of comparing, hence in the context of modeling, we also can recognize that the value of this scalar is dependent on the purpose and its operationalization, the target variable. Similarity is nothing which could be measured. For 100% it is the result of an intention.

##### Similarity and the Symbolic

It is more appropriate to understand it as the actualization of a potential. Since the formal result of this actualization is a scalar, i.e. a primitive with only a simple structure, this actualization prepares also the ground for the possibility of a new symbolization. The similarity scalar is able to take three quite different roles. First, it can act as a criterion to impose a differential semi-order under ceteris paribus conditions for modeling. Actual modeling may be strongly dominated by arbitrary, but nevertheless stable habits. Second, the similarity scalar also could be taken as an ultimate “abbreviation” of a complex activity. Third, and finally, the scalar may well appear as a quasi-material entity due to the fact that there is so little inner structure to it.

*It is the “Similarity-Game” that serves as a ground for hatching symbols.*

Imagine playing this game according to the Euclidean rules. Nobody could expect rich or interesting results, of course. The same holds if “similarity” is misunderstood as technical issue, which could be represented or determined as a closed formalism.2

It is clear, that these results are also quite important to understand the working of metaphors in practiced language. Actually, we think that there is no other mode of speaking in “natural,” i.e. practiced languages than the metaphorical mode. The understanding of similarity as a ground for hatching symbols directly leads to the conclusion that words and arrangements of words do not “represent” or “refer to” something. Even more concise we may say that neither things nor signs or symbols are able to carry references. Everything is created in the mind. Yet, and still refuting radical constructivism, we suggest that the tools for this creative work are all taken from the public.

#### Conclusions

As usual, we finally address the question about the relevance of our achieved results for the topic of machine-based epistemology.

From a technical perspective, the most salient insight is probably the relativity of similarity. This relativity renders similarity into a strictly non-ontological concept (we anyway think that the idea of “pure” ontology is based on a great misunderstanding). Despite the fact that it is pretended thousands of times each day that “the” similarity has been calculated, such “calculation” is not possible. The reason for this is simply that (1) it is not just a calculation as for instance, the calculation of the Easter date, and (2) there is nothing like “the” similarity that could be calculated.

In any implementation that provides means for the comparison of items we have to care for an appropriate generality. Similarity should never be implemented as formula, but instead as an (templated, or abstract) object. Another (not only) technical aspect concerns the increasing importance of the “form-factor” when comparing profiles the more assignates are used to compare the items. This should be respected in any implementation of a similarity measure by increasing the weight of such “correlational” aspects.

From a philosophical perspective there are several interesting issues to mention. It should be clear that our notion of similarity is not following the realist account. Our notion of similarity is not directed towards the relation of “objects” in the “physical world” and “concepts” in the “mental world.” Please excuse the inflationary usage of quotation marks, yet it is not possible otherwise to repel realism in such sentences. Indeed, we think that the similarity can’t be applied to concepts at all. Trying to do so [e.g. 17] one would commit a double categorical mistake: First, concepts may arise exclusively as an *embedment* (not:entailment) of symbols, which in turn require similarity as an operational field. It is impossible to apply similarity to concepts without further interpretation. Second, concepts can’t be positively determined and they are best conceived as transcendental choreostemic poles. This categorically excludes the application of the concept of similarity to the concept of concepts. A naturalization by means of (artificial) neuronal structures [e.g. 18] is missing the point even more dramatically.3 “Concept” and “similarity” are mutually co-extensive, “similarly” to space and time.

As always, we think that there is the primacy of interpretation, hence it is useless to talk about a “physical world as it is as-such.” We do not deny that there is a physical outside, of course. Through extensive individual modeling that is not only shared by a large community, but also has to provide some anticipatory utility, we even may achieve insights, i.e. derive concepts, that one could call “similar” with regard to a world. But again, this would require a position outside of the world AND outside of the used concepts and practiced language. Such a position is not available. “Objects” do not “exist” prior to interpretation. “Objecthood” derives from (abstract) substance by adding a lot of particular, often “structural,” assignates within the process of modeling, most salient by imposing purpose and the respective instance of the concept of similarity.

There are two conclusions from that. First, similarity is a purely operational concept, it does not imply any kind of relation to an “existent” reference. Second, it would be wrong to limit similarity (and modeling, anticipation, comparison, etc.) to external entities like material “objects.” With the exception of pure logic, we always have to interpret. We interpret by using a word in thought, we interpret even if our thoughts are shapeless. Thinking is an open, processual system of cascaded modeling relations. Modeling starts with the material interaction between material aspects of “objects” or bodies, it takes place throughout the perception of external differences, the transduction and translation of internal signals, the establishment of intensions and concepts in our associative networks, up to the ideas of inference and propositional content.

Our investigations ended by describing similarity as a scalar. This dimensionless appearance should not be misunderstood in a representationalist manner, that is, as indication that similarity does not have a structure. Our analysis revealed important structural aspects that relate to many areas in philosophy.

In other chapters we have seen that modeling and comparing are inevitable actions. Due to their transcendental character we even may say that they are inevitable events. As subjects, we can’t evade the necessity of modeling. We can do it in a diagnostic attitude, directed backward in time, or we can do it in a predictive or anticipatory attitude, directed forward in time. Both directions are connected through learning and bridged by Peirce’s sign situation, but any kind of starting point reduces to modeling. If spelled out in a sudden manner it may sound strange that modeling and comparing are deeply inscribed into the event-structure of the world. Yet, there are good reasons to think so.

Before this background, similarity denotes a hot spot for the actualization of intentions. As an element in modeling it is the operation to transport purposes into the world and its perception. Even more concentrated we may call similarity the *carrier of purpose*. For all other elements of modeling besides the purpose and similarity one can refer to “necessities,” such like material constraints, or limitations regarding time and energy. (Obviously, contingent selections are nothing one can speak about in other ways than just by naming them, they are singularities.)

Saying this it is is clear that the relative neglect of similarity against logic should be corrected. Similarity is the (abstract) hatching-ground for symbols, so to say, in Platonic terms, *the sky for the ideas*.

Notes

1. The geometrical approach is largely equal to what today is known as the feature vector approach, which is part of any dimensional mapping. Examples are multi-dimensional scaling, principal component analysis, or self-organizing maps.

2. Category theory provides a formalism that is not closed, since categories can be defined in terms of category theory. This self-referentiality is unique among formal approaches. Examples for closed formalisms are group theory, functional analysis or calculi like λ-calculus.

3. Besides that, Christoph Gauker provided further arguments that concepts could not conceived as regions of similarity spaces [19].

- [1] McGill, M.,Koll, M., and Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Final report for grant NSF-IST-78-10454 to the National Science Foundation, Syracuse University.
- [2] Wesley C. Salmon.
- [3] van Rooij, Robert. 2011c. Vagueness and linguistics. In: G. Ronzitti (ed.), The vagueness handbook, Springer New York, 2011.
- [4] Lakoff
- [5] Haverkamp (ed.), Metaphorologie
- [6] Duden. Das Herkunftswörterbuch. Die Etymologie der Deutschen Sprache. Mannheim 1963.
- [7] Oxford Encyclopedia of Semiotics: Semiotic Terminology. available online, last accessed 29.12.2011.
- [8] Amos Tversky (1977), Features of Similarity. Psychological Review, Vol.84, No.4. available online
- [9] Goldstone. Comparison. Springer, New York 2010.
- [10] Dekang Lin, An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th International Conference on Machine Learning ICML, 1998, pp. 296-304. download
- [11] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). From Data Mining to Knowledge Discovery in Databases. American Association for Artificial Intelligence, p.37-54.
- [12] Thomas Reinartz, Focusing Solutions for Data Mining: Analytical Studies and Experimental Results in Real-World Domains (LNCS) Springer, Berlin 1999.
- [13] G. Nakaeizadeh (ed.). Data mining. Physica Weinheim, 1998.
- [14] Alan Hájek (2007), The Reference Class Problem is Your Problem Too. Synthese 156: 185-215. available online.
- [15] Edward K. Cheng (2009), A Practical Solution to the Reference Class Problem. COLUMBIA LAW REVIEW Vol.109:2081-2105. download
- [16] Peirce, Stanford Encyclopedia.
- [17] Tim Schroeder (2007), A Recipe for Concept Similarity. Mind & Language, Vol.22 No.1. pp. 68–91.
- [18] Churchland
- [19] Christoph Gauker (2007), A Critique of the Similarity Space Theory of Concepts. Mind & Language, Vol.22 No.4, pp.317–345.

۞

Reblogged this on Vanessa's Imiloa and commented:

Very nice blog on raising the attention for ‘Similarity Measurement’