Prolegomena to a Morphology of Experience

May 2, 2012 § Leave a comment

Experience is a fundamental experience.

The very fact of this sentence demonstrates that experience differs from perception, much like phenomena are different from objects. It also demonstrates that there can’t be an analytic treatment or even solution of the question of experience. Experience is not only related to sensual impressions, but also to affects, activity, attention1 and associations. Above all, experience is deeply linked to the impossibility to know anything for sure or, likewise, apriori. This insight is etymologically woven into the word itself: in Greek, “peria” means “trial, attempt, experience”, influencing also the roots of “experiment” or “peril”.

In this essay we will focus on some technical aspects that are underlying the capability to experience. Before we go in medias res, I have to make clear the rationale for doing so, since, quite obviously so, experience could not be reduced to those said technical aspects, to which for instance modeling belongs. Experience is more than the techné of sorting things out [1] and even more than the techné of the genesis of discernability, but at the same time it plays a particular, if not foundational role in and for the epistemic process, its choreostemic embedding and their social practices.

Epistemic Modeling

As usual, we take the primacy of interpretation as one of transcendental conditions, that is, it is a condition we can‘t go beyond, even on the „purely“ material level. As a suitable operationalization of this principle, still a quite abstract one and hence calling for situative instantiation, we chose the abstract model. In the epistemic practice, the modeling does not, indeed, even never could refer to data that is supposed to „reflect“ an external reality. If we perform modeling as a pure technique, we are just modeling, but creating a model for whatsoever purpose, so to speak „modeling as such“, or purposed modeling, is not sufficient to establish an epistemic act, which would include the choice of the purpose and the choice of the risk attitude. Such a reduction is typical for functionalism, or positions that claim a principle computability of epistemic autonomy, as for instance the computational theory of mind does.

Quite in contrast, purposed modeling in epistemic individuals already presupposes the transition from probabilistic impressions to propositional, or say, at least symbolic representation. Without performing this transition from potential signals, that is mediated „raw“ physical fluctuations in the density of probabilities, to the symbolic it is impossible to create a structure, let it be for instance a feature vector as a set of variably assigned properties, „assignates“, as we called it previously. Such a minimal structure, however, is mandatory for purposed modeling. Any (re)presentation of observations to a modeling methods thus is already subsequent to prior interpretational steps.

Our abstract model that serves as an operationalization of the transcendental principle of the primacy of interpretation thus must also provide, or comprise, the transition from differences into proto-symbols. Proto-symbols are not just intensions or classes, they are so to speak non-empiric classes that have been derived from empiric ones by means of idealization. Proto-symbols are developed into symbols by means of the combination of naming and an associated practice, i.e a repeating or reproducible performance, or still in other words, by rule-following. Only on the level of symbols we then may establish a logic, or claiming absolute identity. Here we also meet the reason for the fact that in any real-world context a “pure” logic is not possible, as there are always semantic parts serving as a foundation of its application. Speaking about “truth-values” or “truth-functions” is meaningless, at least. Clearly, identity as a logical form is a secondary quality and thus quite irrelevant for the booting of the capability of experience. Such extended modeling is, of course, not just a single instance, it is itself a multi-leveled thing. It even starts with the those properties of the material arrangement known as body that allow also an informational perspective. The most prominent candidate principle of such a structure is the probabilistic, associative network.

Epistemic modeling thus consists of at least two abstract layers: First, the associative storage of random contexts (see also the chapter “Context” for their generalization), where no purpose is implied onto the materially pre-processed signals, and second, the purposed modeling. I am deeply convinced that such a structure is only way to evade the fallacy of representationalism2. A working actualization of this abstract bi-layer structure may comprise many layers and modules.

Yet, once one accepts the primacy of interpretation, and there is little to say against it, if anything at all, then we are lead directly to epistemic modeling as a mandatory constituent of any interpretive relationship to the world, for primitive operations as well as for the rather complex mental life we experience as humans, with regard to our relationships to the environment as well as with regard to our inner reality. Wittgenstein emphasized in his critical solipsism that the conception of reality as inner reality is the only reasonable one [3]. Epistemic modeling is the only way to keep meaningful contact with the external surrounds.

The Bridge

In its technical parts experience is based on an actualization of epistemic modeling. Later we will investigate the role and the usage of these technical parts in detail. Yet, the gap between modeling, even if conceived as an abstract, epistemic modeling, and experience is so large that we first have to shed some light on the bridge between these concepts. There are some other issues with experience than just the mere technical issues of modeling that are not less relevant for the technical issues, too.

Experience comprises both more active and more passive aspects, both with regard to performance and to structure. Both dichotomies must not be taken as ideally separated categories, of course. Else, the basic distinction into active and passive parts is not a new one either. Kant distinguished receptivity and spontaneity as two complementary faculties that combine in order to bring about what we call cognition. Yet, Leibniz, in contrast, emphasized the necessity of activity even in basic perception; nowadays, his view has been greatly confirmed by the research on sensing in organic (animals) as well as in in-organic systems (robots). Obviously, the relation between activity and passivity is not a simple one, as soon as we are going to leave the bright spheres of language.3

In the structural perspective, experience unfolds in a given space that we could call the space of experiencibility4. That space is spanned, shaped and structured by open and dynamic collections of any kind of theory, model, concept or symbol as well as by the mediality that is “embedding” those. Yet, experience also shapes this space itself. The situation reminds a bit to the relativistic space in physics, or the social space in humans, where the embedding of one space into another one will affect both participants, the embedded as well as the embedding space. These aspects we should keep in mind for our investigation of questions about the mechanisms that contribute to experience and the experience of experience. As you can see, we again refute any kind of ontological stances even to their smallest degrees.5

Now when going to ask about experience and its genesis, there are two characteristics of experience that enforce us to avoid the direct path. First, there is the deep linkage of experience to language. We must get rid of language for our investigation in order to avoid the experience of finding just language behind the language or what we call upfront “experience”; yet, we also should not forget about language either. Second, there is the self-referentiality of the concept of experience, which actually renders it into a strongly singular term. Once there are even only tiny traces of the capability for experience, the whole game changes, burying the initial roots and mechanisms that are necessary for the booting of the capability.

Thus, our first move consists in a reduction and linearization, which we have to catch up with later again, of course. We will achieve that by setting everything into motion, so-to-speak. The linearized question thus is heading towards the underlying mechanisms6:

How do we come to believe that there are facts in the world? 7

What are—now viewed from the outside of language8—the abstract conditions and the practiced moves necessary and sufficient for the actualization­­ of such statements?

Usually, the answer will refer to some kind of modeling. Modeling provides the possibility for the transition from the extensional epistemic level of particulars to the intensional epistemic level of classes, functions or categories. Yet, modeling does not provide sufficient reason for experience. Sure, modeling is necessary for it, but it is more closely related to perception, though also not being equivalent to it. Experience as a kind of cognition thus can’t be conceived as kind of a “high-level perception”, quite contrary to the suggestion of Douglas Hofstadter [4]. Instead, we may conceive experience, in a first step, as the result and the activity around the handling of the conditions of modeling.

Even in his earliest writings, Wittgenstein prominently emphasized that it is meaningless to conceive of the world as consisting from “objects”. The Tractatus starts with the proposition:

The world is everything that is the case.

Cases, in the Tractatus, are states of affairs that could be made explicit into a particular (logical) form by means of language. From this perspective one could derive the radical conclusion that without language there is no experience at all. Despite we won’t agree to such a thesis, language is a major factor contributing to some often unrecognized puzzles regarding experience. Let us very briefly return to the issue of language.

Language establishes its own space of experiencibility, basically through its unlimited expressibility that induces hermeneutic relationships. Probably mainly to this particular experiential sphere language is blurring or even blocking clear sight to the basic aspects of experience. Language can make us believe that there are phenomena as some kind of original stuff, existing “independently” out there, that is, outside the human cognition.9 Yet, there is no such thing like a phenomenon or even an object that would “be” before experience, and for us humans even not before or outside of language. It is even not reasonable to speak about phenomena or objects as if they would exist before experience. De facto, it is almost non-sensical to do so.

Both, objects as specified entities and phenomena at large are consequences of interpretation, in turn deeply shaped by cultural imprinting, and thus heavily depending on language. Refuting that consequence would mean to refute the primacy of interpretation, which would fall into one of the categories of either naive realism or mysticism. Phenomenology as an ontological philosophical discipline is nothing but a mis-understanding (as ontology is henceforth); since phenomenology without ontological parts must turn into some kind of Wittgensteinian philosophy of language, it simply vanishes. Indeed, when already being teaching in Cambridge, Wittgenstein once told a friend to report his position to the visiting Schlick, whom he refused to meet on this occasion, as “You could say of my work that it is phenomenology.” [5] Yet, what Wittgenstein called “phenomenology” is completely situated inside language and its practicing, and despite there might be a weak Kantian echo in his work, he never supported Husserl’s position of synthetic universals apriori. There is even some likelihood that Wittgenstein, strongly feeling to be constantly misunderstood by the members of the Vienna Circle, put this forward in order to annoy Schlick (a bit), at least to pay him back in kind.

Quite in contrast, in a Wittgensteinian perspective facts are sort of collectively compressed beliefs about relations. If everybody believes to a certain model of whatever reference and of almost arbitrary expectability, then there is a fact. This does not mean, however, that we get drowned by relativism. There are still the constraints implied by the (unmeasured and unmeasurable) utility of anticipation, both in its individual and its collective flavor. On the other hand, yes, this indeed means that the (social) future is not determined.

More accurately, there is at least one fact, since the primacy of interpretation generates at least the collectivity as a further fact. Since facts are taking place in language, they do not just “consist” of content (please excuse such awful wording), there is also a pragmatics, and hence there are also at least two different grammars, etc.etc.

How do we, then, individually construct concepts that we share as facts? Even if we would need the mediation by a collective, a large deal of the associative work takes place in our minds. Facts are identifiable, thus distinguishable and enumerable. Facts are almost digitized entities, they are constructed from percepts through a process of intensionalization or even idealization and they sit on the verge of the realm of symbols.

Facts are facts because they are considered as being valid, let it be among a collective of people, across some period of time, or a range of material conditions. This way they turn into kind of an apriori from the perspective of the individual, and there is only that perspective. Here we find the locus situs of several related misunderstandings, such as direct realism, Husserlean phenomenology, positivism, the thing as such, and so on. The fact is even synthetic, either by means of “individual”10 mental processes or by the working of a “collective reasoning”. But, of course, it is by no means universal, as Kant concluded on the basis of Newtonian science, or even as Schlick did in 1930 [6]. There is neither a universal real fact, nor a particular one. It does not make sense to conceive the world as existing from independent objects.

As a consequence, when speaking about facts we usually studiously avoid the fact of risk. Participants in the “fact game” implicitly agree on the abandonment of negotiating affairs of risk. Despite the fact that empiric knowledge never can be considered as being “safe” or “secured”, during the fact game we always behave as if. Doing so is the more or less hidden work of language, which removes the risk (associated with predictive modeling) and replaces it by metaphorical expressibility. Interestingly, here we also meet the source field of logic. It is obvious (see Waves & Words) that language is neither an extension of logics, nor is it reasonable to consider it as a vehicle for logic, i.e. for predicates. Quite to the contrast, the underlying hypothesis is that (practicing) language and (weaving) metaphors is the same thing.11 Such a language becomes a living language that (as Gier writes [5])

“[…] grows up as a natural extension of primitive behavior, and we can count on it most of the time, not for the univocal meanings that philosophers demand, but for ordinary certainty and communication.”

One might just modify Gier’s statement a bit by specifying „philosophers“ as idealistic, materialistic or analytic philosophers.

In “On Certainty” (OC, §359), Wittgenstein speaks of language as expressing primitive behavior and contends that ordinary certainty is “something animal”. This now we may take as a bridge that provides the possibility to extend our asking about concepts and facts towards the investigation of the role of models.

Related to this there is a pragmatist aspect that is worthwhile to be mentioned. Experience is a historicizing concept, much like knowledge. Both concepts are meaningful only in hindsight. As soon as we consider their application, we see that both of them refer only to one half of the story that is about the epistemic aspects of „life“. The other half of the epistemic story and directly implied by the inevitable need to anticipate is predictive or, equivalently, diagnostic modeling. Abstract modeling in turn implies theory, interpretation and orthoregulated rule-following.

Epistemology thus should not be limited to „knowledge“, the knowable and its conditions. Epistemology has explicitly to include the investigation of the conditions of what can be anticipated.

In a still different way we thus may repose the question about experience as the transition from epistemic abstract modeling to the conditions of that modeling. This would include the instantiation of practicable models as well as the conditions for that instantiation, and also the conditions of the application of models.In technical terms this transition is represented by a problematic field: The model selection problem, or in more pragmatic terms, the model (selection) risk.

These two issues, the prediction task and the condition of modeling now form the second toehold of our bridge between the general concept of experience and some technical aspects of the use of models. There is another bridge necessary to establish the possibility of experience, and this one connects the concept of experience with languagability.

The following list provides an overview about the following chapters:

These topics are closely related to each other, indeed so closely that other sequences would be justifiable too. Their interdependencies also demand a bit of patience from you, the reader, as the picture will be complete only when we arrive at the results of modeling.

A last remark may be allowed before we start to delve into these topics. It should be clear by now that any kind of phenomenology is deeply incompatible with the view developed here. There are several related stances, e.g. the various shades of ontology, including the objectivist conception of substance. They are all rendered as irrelevant and inappropriate for any theory about episteme, whether in its machine-based form or regarding human culture, whether as practice or as reflecting exercise.

The Modeling Statement

As the very first step we have to clearly state the goal of modeling. From the outside that goal is pretty clear. Given a set of observations and the respective outcomes, or targets, create a mapping function such that the observed data allow for a reconstruction of the outcome in an optimized manner. Finding such a function can be considered as a simple form of learning if the function is „invented“. In most cases it is not learning but just the estimation of pre-defined parameters.12 In a more general manner we also could say that any learning algorithm is a map L from data sets to a ranked list of hypothesis functions. Note that accuracy is only one of the possible aspects of that optimization. Let us call this for convenience the „outer goal“ of modeling. Would such mapping be perfect within reasonable boundaries, we would have found automatically a possible transition from probabilistic presentation to propositional representation. We could consider the induction of a structural description from observations as completed. So far the secret dream of Hans Reichenbach, Carl Schmid-Hempel, Wesley Salmon and many of their colleagues.

The said mapping function will never be perfect. The reasons for this comprise the complexity of the subject, noise in the measured data, unsuitable observables or any combinations of these. This induces a wealth of necessary steps and, of course, a lot of work. In other words, a considerable amount of apriori and heuristic choices have to be taken. Since a reliable, say analytic mapping can’t be found, every single step in the value chain towards the model at once becomes questionable and has to be checked for its suitability and reliability. It is also clear that the model does not comprise just a formula. In real-world situations a differential modeling should be performed, much like in medicine a diagnosis is considered to be complete only if a differential diagnosis is included. This comprises the investigation of the influence of the method’s parameterization onto the results. Let us call the whole bunch of respective goals the „inner goals“ of modeling.

So, being faced with the challenge of such empirical mess, how does the statement about the goals of the „inner modeling“ look like? We could for instance demand to remove the effects of the shortfalls mentioned above, which cause the imperfect mapping: complexity of the subject, noise in the measured data, or unsuitable observables.

To make this more concrete we could say, that the inner goals of modeling consist in a two-fold (and thus synchronous!) segmentation of the data, resulting in the selection of the proper variables and in the selection of the proper records, where this segmentation is performed under conditions of a preceding non-linear transformation of the embedding reference system. Ideally, the model identifies the data for which it is applicable. Only for those data then a classification is provided. It is pretty clear that this statement is an ambitious one. Yet, we regard it as crucial for any attempt to step across our epistemic bridge that brings us from particular data to the quality of experience. This transition includes something that is probably better known by the label „induction“. Thus, we finally arrive at a short statement about the inner goals of modeling:

How to conclude and what to conclude from measured data?

Obviously, if our data are noisy and if our data include irrelevant values any further conclusion will be unreliable. Yet, for any suitable segmentation of the data we need a model first. From this directly follows that a suitable procedure for modeling can’t consist just from a single algorithm, or a „one-shot procedure“. Any instance of single-step approaches are suffering from lots of hidden assumptions that influence the results and its properties in unforeseeable ways. Modeling that could be regarded as more than just an estimation of parameters by running an algorithm is necessarily a circular and—dependent on the amount of variables­—possibly open-ended process.

Predictability and Predictivity

Let us assume a set of observations S obtained from an empirical process P. Then ­­­this process P should be called “predictable” if the results of the mapping function f(m) that serves as an instance of a hypothesis h from the space of hypotheses H coincides with the outcome of the process P in such a way that f(m) forms an expectation with a deviation d<ε for all f(m). In this case we may say that f(m) predicts P. This deviation is also called “empirical risk”, and the purpose of modeling is often regarded as minimizing the empirical risk (ERM).

There are then two important questions. Firstly, can we trust f(m), since f(m) has been built on a limited number of observations? Secondly, how can we make f(m) more trustworthy, given the limitation regarding the data? Usually, these questions are handled under the label of validation. Yet, validation procedures are not the only possible means to get an answer here. It would be a misunderstanding to think that it is the building or construction of a model that is problematic.

The first question can be answered only by considering different models. For obtaining a set of different models we could apply different methods. That would be o.k. if prediction would be our sole interest. Yet, we also strive for detecting structural insights. And from that perspective we should not, of course, use different methods to get different models. The second possibility for addressing the first question is to use different sub-samples, which turns simple validation into a cross-validation. Cross-validation provides an expectation for the error (or the risk). Yet, in order to compare across methods one actually should describe the expected decrease in “predictive power”13 for different sample sizes (independent cross-validation per sample size). The third possibility for answering question (1) is related to the the former and consists by adding noised, surrogated (or simulated) data. This prevents the learning mechanism from responding to empirically consistent, but nevertheless irrelevant noisy fluctuations in the raw data set. The fourth possibility is to look for models of equivalent predictive power, which are, however, based on a different set of predicting variables. This possibility is not accessible for most statistical approaches such like Principal Component Analysis (PCA). Whatever method is used to create different models, models may be combined into a “bag” of models (called “bagging”), or, following an even more radical approach, into an ensemble of small and simple models. This is employed for instance in the so-called Random Forest method.

Commonly, if a model passes cross-validation successfully, it is considered to be able to “generalize”. In contrast to the common practice, Poggio et al. [7] demonstrated that standard cross-validation has to be extended in order to provide a characterization of the capability of a model to generalize. They propose to augment

CV1oo stability with stability of the expected error and stability of the empirical error to define a new notion of stability, CVEEE1oo stability.

This makes clear that Poggio’s et al. approach is addressing the learning machinery, not any longer just the space of hypotheses. Yet, they do not take the free parameters of the method into account. We conclude that their proposed approach still remains an uncritical approach. Thus I would consider such a model as not completely trustworthy. Of course, Poggio et al. are definitely pointing towards the right direction. We recognize a move away from naive realism and positivism, instead towards a critical methodology of the conditional. Maybe, philosophy and natural sciences find common grounds again by riding the information tiger.

Checking the stability of the learning procedure leads to a methodology that we called “data experiments” elsewhere. The data experiments do NOT explore the space of hypotheses, at least not directly. Instead they create a map for all possible models. In other words, instead of just asking about the predictability we now ask about the differential predictivity of in the space of models.

From the perspective of a learning theory Poggio’s move can’t be underestimated. Statistical learning theory (SLT)[8] explicitly assumes that a direct access to the world is possible (via identity function, perfectness of the model). Consequently, SLT focuses (only) on the reduction of the empirical risk. Any learning mechanism following the SLT is hence uncritical about its own limitation. SLT is interested in the predictability of the system-as-such, thereby not rather surprisingly committing the mistake of pre-19th century idealism.

The Independence Assumption

The independence assumption [I.A.], or linearity assumption, acts mainly on three different targets. The first of them is the relationship between observer and observed, while its second target is the relationship between observables. The third target finally regards the relation between individual observations. This last aspect of the I.A. is the least problematic one. We will not discuss this any further.

Yet, the first and the second one are the problematic ones. The I.A. is deeply buried into the framework of statistics and from there it made its way into the field of explorative data analysis. There it can be frequently met for instance in the geometrical operationalization of similarity, the conceptualization of observables as Cartesian dimensions or independent coefficients in systems of linear equations, or as statistical kernels in algorithms like the Support Vector Machine.

Of course, the I.A. is just one possible stance towards the treatment of observables. Yet, taking it as an assumption we will not include any parameter into the model that reflects the dependency between observables. Hence, we will never detect the most suitable hypothesis about the dependency between observables. Instead of assuming the independence of variables throughout an analysis it would be methodological much more sound to address the degree of dependency as a target. Linearity should not be an assumption, it should be a result of an analysis.

The linearity or independence assumption carries another assumption with it under its hood: the assumption of the homogeneity of variables. Variables, or assignates, are conceived as black-boxes, with unknown influence onto the predictive power of the model. Yet, usually they exert very different effects on the predictive power of a model.

Basically, it is very simple. The predictive power of a model depends on the positive predictive value AND the negative predictive value, of course; we may also use closely related terms sensitivity and specificity. Accordingly, some variables contribute more to the positive predictive value, other help to increase the negative predictive value. This easily becomes visible if we perform a detailed type-I/II error analysis. Thus, there is NO way to avoid testing those combinations explicitly, even if we assume the initial independence of variables.

As we already mentioned above, the I.A. is just one possible stance towards the treatment of observables. Yet, its status as a methodological sine qua non that additionally is never reflected upon renders it into a metaphysical assumption. It is in fact an irrational assumption, which induces serious costs in terms of the structural richness of the results. Taken together, the independence assumption represents one of the most harmful habits in data analysis.

The Model Selection Problem

In the section “Predictability and Predictivity” above we already emphasized the importance of the switch from the space of hypotheses to the space of models. The model space unfolds as a condition of the available assignates, the size of the data set and the free parameters of the associative (“modeling”) method. The model space supports a fundamental change of the attitude towards a model. Based on the denial of the apriori assumption of independence of observables we identified the idea of a singular best model as an ill-posed phantasm. We thus move onwards from the concept of a model as a mapping function towards ensembles of structurally heterogeneous models that together as a distinguished population form a habitat, a manifold in the sphere of the model space. With such a structure we neither need to arrive at a single model.

Methods, Models, Variables

The model selection problem addresses two sets of parameters that are actually quite different from each other. Model selection should not be reduced to the treatment of the first set, of course, as it happens at least implicitly for instance in [9]. The first set refers to the variables as known from the data, sometimes also called the „predictors“. The selection of the suitable variables is the first half of the model selection problem. The second set comprises all free parameters of the method. From the methodological point of view, this second set is much more interesting than the first one. The method’s parameters are apriori conditions to the performance of the method, which additionally usually remain invisible in the results, in contrast to the selection of variables.

For associative methods like SOM or other clustering variables the effect of de-/selecting variables can be easily described. Just take all the objects in front of you, for instance on the table, or in your room. Now select an arbitrary purpose and assign this purpose as a degree of support to those objects. For now, we have constructed the target. Now we go “into” the objects, that is, we describe them by a range of attributes that are present in most of the objects. Dependent on the selection of  a subset from these attributes we will arrive at very different groups. The groups now represent the target more or less, that’s the quality of the model. Obviously, this quality differs across the various selections of attributes. It is also clear that it does not help to just use all attributes, because some of the attributes just destroy the intended order, they add noise to the model and decrease its quality.

As George observes [10], since its first formulation in the 1960ies a considerable, if not large number of proposals for dealing with the variable selection problem have been proposed. Although George himself seem to distinguish the two sets of parameters, throughout the discussion of the different approaches he always refers just to the first set, the variables as included in the data. This is not a failure of the said author, but a problem of the statistical approach. Usually, the parameters of statistical procedures are not accessible, as any analytic procedure, they work as they work. In contrast to Self-organizing Maps, and even to Artificial Neural Networks (ANN) or Genetic Procedures, analytic procedures can’t be modified in order to achieve a critical usage. In some way, with their mono-bloc design they perfectly fit into representationalist fallacy.

Thus, using statistical (or other analytic) procedures, the model selection problem consists of the variable selection problem and the method selection problem. The consequences are catastrophic: If statistical methods are used in the context of modeling, the whole statistical framework turns into a black-box, because the selection of a particular method can’t be justified in any respect. In contrast to that quite unfavorable situation, methods like the Self-Organizing Map provide access to any of its parameters. Data experiments are only possible with methods like SOM or ANN. Not the SOM or the ANN are „black-boxes“, but the statistical framework must be regarded as such. Precisely this is also the reason for the still ongoing quarrels about the foundations of the statistical framework. There are two parties, the frequentists and the bayesians. Yet, both are struck by the reference class problem [11]. From our perspective, the current dogma of empirical work in science need to be changed.

The conclusion is that statistical methods should not be used at all to describe real-world data, i.e. for the modeling of real-world processes. They are suitable only within a fully controlled setting, that is, within a data experiment. The first step in any kind of empirical analysis thus must consist of a predictive modeling that includes the model selection task.14

The Perils of Universalism

Many people dealing with the model selection task are mislead by a further irrational phantasm, caused by a mixture of idealism and positivism. This is the phantasm of the single best model for a given purpose.

Philosophers of science long ago recognized, starting with Hume and ultimately expressed by Quine, that empirical observations are underdetermined. The actual challenge posed by modeling is given by the fact of empirical underdetermination. Goodman felt obliged to construct a paradox from it. Yet, there is no paradox, there is only the phantasm  of the single best model. This phantasm is a relic from the Newtonian period of science, where everybody thought the world is made by God as a miraculous machine, everything had to be well-defined, and persisting contradictions had to be rated as evil.

Secondarily, this moults into the affair of (semantic) indetermination. Plainly spoken, there are never enough data. Empirical underdetermination results in the actuality of strongly diverging models, which in turn gives rise to conflicting experiences. For a given set of data, in most cases it is possible to build very different models (ceteris paribus, choosing different sets of variables) that yield the same utility, or say predictive power, as far as this predictive power can be determined by the available data sample at all. Such ceteris paribus difference will not only give rise to quite different tracks of unfolding interpretation, it is also certainly in the close vicinity of Derrida’s deconstruction.

Empirical underdetermination thus results in a second-order risk, the model selection risk. Actually, the model selection risk is the only relevant risk. We can’t change the available data, and data are always limited, sometimes just by their puniness, sometimes by the restrictions to deal with them. Risk is not attached to objects or phenomena, because objects “are not there” before interpretation and modeling. Risk is attached only to models. Risk is a particular state of affair, and indeed a rather fundamental one. Once a particular model would tell us that there is an uncertainty regarding the outcome, we could take measures to deal with that uncertainty. For instance, we hedge it, or organize some other kind of insurance for it. But hedging has to rely on the estimation of the uncertainty, which is dependent on the expected predictive power of the model, not just the accuracy of the model given the available data from a limited sample.

Different, but equivalent selections of variables can be used to create a group of models as „experts“ on a given task to decide on. Yet, the selection of such „experts“ is not determinable on the basis of the given data alone. Instead, further knowledge about the relation of the variables to further contexts or targets needs to be consulted.

Universalism is usually unjustifiable, and claiming it instead usually comes at huge costs, caused by undetectable blindnesses once we accept it. In contemporary empiricism, universalism—and the respecting blindness—is abundant also with regard to the role of the variables. What I am talking about here is context, mediality and individuality, which, from a more traditional formal perspective, is often approximated by conditionality. Yet, it more and more becomes clear that the Bayesian mechanisms are not sufficient to get the complexity of the concept of variables covered. Just to mention the current developments in the field of probability theory I would like to refer to Brian  Weatherson, who favors and develops the so-called dynamic Keynesian models of uncertainty. [10] Yet, we regard this only as a transitional theory, despite the fact that it will have a strong impact on the way scientists will handle empiric data.

The mediating individuality of observables (as deliberately chosen assignates, of course) is easy to observe, once we drop the universalism qua independence of variables. Concerning variables, universalism manifests in an indistinguishability of the choices made to establish the assignates with regard to their effect onto the system of preferences. Some criteria C will induce the putative objects as distinguished ones only, if another assignate A has pre-sorted it. Yet, it would be a simplification to consider the situation in the Bayesian way as P(C|A). The problem with it is that we can’t say anything about the condition itself. Yet, we need to “play” (actually not “control”) with the conditionability, the inner structure of these conditions. As it is with the “relation,” which we already generalized into randolations, making it thereby measurable, we also have to go into the condition itself in order to defeat idealism even on the structural level. An appropriate perspective onto variables would hence treat it as a kind of media. This mediality is not externalizable, though, since observables themselves precipitate from the mediality, then as assignates.

What we can experience here is nothing else than the first advents of a real post-modernist world, an era where we emancipate from the compulsive apriori of independence (this does not deny, of course, its important role in the modernist era since Descartes).

Optimization

Optimizing a model means to select a combination of suitably valued parameters such that the preferences of the users in terms of risk and implied costs are served best. The model selection problem is thus the link between optimization problems, learning tasks and predictive modeling. There are indeed countless many procedures for optimization. Yet, the optimization task in the context of model selection is faced with a particular challenge: its mere size. George begins his article in the following way:

A distinguishing feature of variable selection problems is their enormous size. Even with moderate values of p, computing characteristics for all 2p models is prohibitively expensive and some reduction of the model space is needed.

Assume for instance a data set that comprises 50 variables. From that 1.13e15 models are possible, and assume further that we could test 10‘000 models per second, then we still would need more than 35‘000 years to check all models. Usually, however, building a classifier on a real-world problem takes more than 10 seconds, which would result in 3.5e9 years in the case of 50 variables. And there are many instances where one is faced with much more variables, typically 100+, and sometimes going even into the thousands. That’s what George means by „prohibitively“.

There are many proposals to deal with that challenge. All of them fall into three classes: they use either (1) some information theoretic measure (AIC, BIC, CIC etc. [11]), or (2) they use likelihood estimators, i.e. they conceive of parameters themselves as random variables, or (3) they are based of probabilistic measures established upon validation procedures. Particularly the instances from the first two of those classes are hit by the linearity and/or the independence assumption, and also by unjustified universalism. Of course, linearity should not be an assumption, it should be a result, as we argued above. Hence, there is no way to avoid the explicit calculation of models.

Given the vast number of combinations of symbols it appears straightforward to conceive of the model selection problem from an evolutionary perspective. Evolution always creates appropriate and suitable solutions from the available „evolutionary model space“. That space is of size 230‘000 in the case of humans, which is a „much“ larger number than the number of species ever existent on this planet. Not a single viable configuration could have been found by pure chance. Genetics-based alignment and navigation through the model space is much more effective than chance. Hence, the so-called genetic algorithms might appear on the radar as the method of choice .

Genetics, revisited

Unfortunately, for the variable selection problem genetic algorithms15 are not suitable. The main reason for this is still the expensive calculation of single models. In order to set up the genetic procedure, one needs at least 500 instances to form the initial population. Any solution for the variable selection problem should arrive at a useful solution with less than 200 explicitly calculated models. The great advantage of genetic algorithms is their capability to deal with solution spaces that contain local extrema. They can handle even solution spaces that are inhomogeneously rugged, simply for the reason that recombination in the realm of the symbolic does not care about numerical gradients and criteria. Genetic procedures are based on combinations of symbolic encodings. The continuous switch between the symbolic (encoding) and the numerical (effect) are nothing else than the pre-cursors of the separation between genotypes and phenotypes, without which there would not be even simple forms of biological life.

For that reason we developed a specialized instantiation of the evolutionary approach (implemented in SomFluid). Described very briefly we can say that we use evolutionary weights as efficient estimators of the maximum likelihood of parameters. The estimates are derived from explicitly calculated models that vary (mostly, but not necessarily ceteris paribus) with respect to the used variables. As such estimates, they influence the further course of the exploration of the model space in a probabilistic manner. From the perspective of the evolutionary process, these estimates represent the contribution of the respective parameter to the overall fitness of the model. They also form a kind of long-term memory within the process, something like a probabilistic genome. The short-term memory in this evolutionary process is represented by the intensional profiles of the nodes in the SOM.

For the first initializing step, the evolutionary estimates can be estimated themselves by linear procedure like the PCA, or by non-parametric procedures (Kruskal-Wallis, Mann-Whitney, etc.), and are available after only a few explicitly calculated models (model here means „ceteris paribus selection of variables“).

These evolutionary weights reflect the changes of the predictive power of the model when adding or removing variables to the model. If the quality of the model improves, the evolutionary weight increases a bit, and vice versa. In other words, not the apriori parameters of the model are considered, but just the effect of the parameters. The procedure is an approximating repetition: fix the parameters of the model (method specific, sampling, variables), calculate the model, record the change of the predictive power as compared to the previous model.

Upon the probabilistic genome of evolutionary weights there are many different ways one could take to implement the “evo-devo” mechanisms, let it be the issue of how to handle the population (e.g. mixing genomes, aspects of virtual ecology, etc.), or the translational mechanisms, so to speak the “physiologies” that are used to proceed from the genome to an actual phenotype.

Since many different combinations are being calculated, the evolutionary weight represents the expectable contribution of a variable to the predictive power of the model, under whatsoever selection of variables that represents a model. Usually, a variable will not improve the quality of the model irrespective to the context. Yet, if a variable indeed would do so, we not only would say that its evolutionary weight equals 1, we also may conclude that this variable is a so-called confounder. Including a confounder into a model means that we use information about the target, which will not be available when applying the model for classification of new data; hence the model will fail disastrously. Usually, and that’s just a further benefit of dropping the independence-universalism assumption, it is not possible for a procedure to identify confounders by itself. It is also clear that the capability to do so is one of the cornerstones of autonomous learning, which includes the capability to set up the learning task.

Noise, and Noise

Optimization raises its own follow-up problems, of course. The most salient of these is so-called overfitting. This means that the model gets suitably fitted to the available observations by including a large number of parameters and variables, but it will return wrong predictions if it is going to be used on data that are even only slightly different from the observations used for learning and estimating the parameters of the model. The model represents noise, random variations without predictive value.

As we have been describing above, Poggio believes that his criterion of stability overcomes the defects with regard to the model as a generalization from observations. Poggio might be too optimistic, though, since his method still remains to be confined to the available observations.

In this situation, we apply a methodological trick. The trick consists in turning the problem into a target of investigation, which ultimately translates the problem into an appropriate rule. In this sense, we consider noise not as a problem, but as a tool.

Technically, we destroy the relevance of the differences between the observations by adding noise of a particular characteristic. If we add a small amount of normally distributed noise, nothing will probably change, but if we add a lot of noise, perhaps even of secondarily changing distribution, this will result in the mere impossibility to create a stable model at all. The scientific approach is to describe the dependency between those two unknowns, so to say, to set up a differential between noise (model for the unknown) and the model (of the unknown). The rest is straightforward: creating various data sets that have been changed by imposing different amounts of noise of a known structure, and plotting the predictive power against the the amount of noise. This technique can be combined by surrogating the actual observations via a Cholesky decomposition.

From all available models then those are preferred that combine a suitable predictive power with suitable degree of stability against noise.

Résumé

In this section we have dealt with the problematics of selecting a suitable subset from all available observables (neglecting for the time being that model selection involves the method’s parameters, too). Since mostly we have more observables at our disposal than we actually presume to need, the task could be simply described as simplification, aka Occam’s Razor. Yet, it would be terribly naive to first assume linearity and then selecting the “most parsimonious” model. It is even cruel to state [9, p.1]:

It is said that Einstein once said

Make things as simple as possible, but not simpler.

I hope that I succeeded in providing some valuable hints for accomplishing that task, which above all is not a quite simple one. (etc.etc. :)

Describing Classifiers

The gold standard for describing classifiers is believed to be the Receiver-Operator-Characteristic, or short, ROC. Particularly, the area under the curve is compared across models (classifiers). The following Figure 1demonstrates the mechanics of the ROC plot.

Figure 1: Basic characteristics of the ROC curve (reproduced from Wikipedia)

Figure 2. Realistic ROC curves, though these are typical for approaches that are NOT based on sub-group structures or ensembles (for instance ANN or logistic regression). Note that models should not be selected on the basis of the Area-under-Curve. Instead the true positive rate (sensitivity) at a false positive rate FPR=0 should be used for that. As a further criterion that would indicate the stability of of the model one could use the slope of the curve at FPR=0.

Utilization of Information

There is still another harmful aspect of the universalistic stance in data analysis as compared to a pragmatic stance. This aspect considers the „reach“ of the models we are going to build.

Let us assume that we would accept a sensitivity of approx 80%, but we also expect a specificity of >99%. In other words, the cost for false positives (FP) are defined as very high, while the costs for false negatives (FN, not recognized preferred outcomes) are relatively low. The ratio of costs for error, or in short the error cost ratio err(FP)/err(FN) is high.

Table 1a: A Confusion matrix for a quite performant classifier.

Symbols: test=model; TP=true positives; FP=false positives; FN=false negatives; TN=true negatives; ppv=positive predictive value, npv=negative predictive value. FN is also called type-I-error (analogous to “rejecting the null hypothesis when it is true”), while FP is called type-II-error (analogous to “accepting the null hypothesis when it is false”), and FP/(TP+FP) is called type-II-error-rate, sometime labeled as β-error, where (1-β) is the called the “power” of the test or model. (download XLS example)

condition Pos

condition Neg

test Pos

100 (TP)

3 (FP)

0.971

ppv

test Neg

28 (FN)

1120 (TN)

0.976

npv

0.781

0.997

sensitivity

specificity

Let us further assume that there are observations of our preferred outcome that we can‘t distinguish well from other cases of the opposite outcome that we try to avoid. They are too similar, and due to that similarity they form a separate group in our self-organizing map. Let us assume that the specificity of these clusters is at 86% only and the sensitivity is at 94%.

Table 1b: Confusion matrix describing a sub-group formed inside the SOM, for instance as it could be derived from the extension of a “node”.

condition Pos

condition Neg

test Pos

0 (50)

0 (39)

0.0 (0.56)

ppv

test Neg

50 (0)

39 (0)

0.44 (1.0)

npv

0.0 (1.0)

1.0 (0.0)

sensitivity

specificity

Yet, this cluster would not satisfy our risk attitude. If we would use the SOM as a model for classification of new observations, and the new observation would fall into that group (by means of similarity considerations) the implied risk would violate our attitude. Hence, we have to exclude such clusters. In the ROC this cluster represents a value further to the right on the specificity (X-) axis.

Note that in the case of acceptance of the subgroup as a representative for a contributor of a positive prediction, the false negative is always 0 aposteriori, and in case of denial the true positives is always set to 0 (accordingly the figures for the condition negative).

There are now several important points to that, which are related to each other. Actually, we should be interested only in such sub-groups with specificity close to 1, such that our risk attitude is well served. [13] Likewise, we should not try to optimize the quality of the model across the whole range of the ROC, but only for the subgroups with acceptable error cost ratio. In other words, we use the available information in a very specific manner.

As a consequence, we have to set the ECR before calculating the model. Setting the ECR after the selection of a model results in a waste of information, time and money. For this reason it is strongly indicated to use methods that are based on building a representation by sub-groups. This again rules out statistical methods as they always take into account all available data. Zytkow calls such methods empirically empty [14].

The possibility to build models of a high specificity is a huge benefit of sub-group based methods like the SOM.16 To understand this better let us assume we have a SOM-based model with the following overall confusion matrix.

condition Pos

condition Neg

test Pos

78

1

0.9873

ppv

test Neg

145

498

0.7745

npv

0.350

0.998

sensitivity

specificity

That is, the model recognizes around 35% of all preferred outcomes. It does so on the basis of sub-groups that all satisfy the respective ECR criterion. Thus we know that the implied risk of any classification is very low too. In other words, such models recognize whether it is allowed to apply them. If we apply them and get a positive answer, we also know that it is justified to apply them. Once the model identifies a preferred outcome, it does so without risk. This lets us miss opportunities, but we won’t be trapped by false expectations. Such models we could call auto-consistent.

In a practical project that has been aiming at an improvement of the post-surgery risk classification of patients (n>12’000) in a hospital we have been able to demonstrate that the achievable validated rate of implied risk can be as low as <10e-4. [15] Such a low rate is not achievable by statistical methods, simply because there are far too few incidents of wrong classifications. The subjective cut-off points in logistic regression are not quite suitable for such tasks.

At the same time, and that’s probably even more important, we get a suitable segmentation of the observations. All observations that can be identified as positive do not suffer from any risk. Thus, we can investigate the structure of the data for these observations, e.g. as particular relationships between variables, such as correlations etc. But, hey, that job is already done by the selection of the appropriate set of variables! In other words, we not only have a good model, we also have found the best possibility for a multi-variate reduction of noise, with a full consideration of the dependencies between variables. Such models can be conceived as reversed factorial experimental design.

The property of auto-consistency offers a further benefit as it is scalable, that is, “auto-consistent” is not a categorical, or symbolic, assignment. It can be easily measured as sensitivity under the condition of specificity > 1-ε, ε→0. Thus, we may use it as a random measure (it can be described by its density) or as a scale of reference in case of any selection task among sub-populations of models. Additionally, if the exploration of the model space does not succeed in finding a model of a suitable degree of auto-consistency, we may conclude that the quality of the data is not sufficient. Data quality is a function of properly selected variables (predictors) and reproducible measurement. We know of no other approach that would be able to inform about the quality of the data without referring to extensive contextual “knowledge”. Needless to say that such knowledge is never available and encodable.

There are only weak conditions that need to be satisfied. For instance, the same selection of variables need to be used within a single model for all similarity considerations. This rules out all ensemble methods, as far as different selections of variables are used for each item in the ensemble; for instance decision tree methods (a SOM with its sub-groups is already “ensemble-like”, yet, all sub-groups are affected by the same selection of variables). It is further required to use a method that performs the transition from extensions to intensions on a sub-group level,which rules out analytic methods, and even Artificial Neural Networks (ANN). The way to establish auto-consistent models is not possible for ANN. Else, the error-cost ratio must be set before calculating the model, and the models have to be calculated explicitly, which removes linear methods from the list, such as Support Vector Machines with linear kernels (regression, ANN, Bayes). If we want to access the rich harvest of auto-consistent models we have to drop the independence hypothesis and we have to refute any kind of universalism. But these costs are rather low, indeed.

Observations and Probabilities

Here we developed a particular perspective onto the transition from observations to intensional representations. There are of course some interesting relationships of our point of view to the various possibilities of “interpreting” probability (see [16] for a comprehensive list of “interpretations” and interesting references). We also provide a new answer to Hume’s problem of induction.

Hume posed the question, how often should we observe a fact until we could consider it as lawful? This question, called the “problem of induction” points to the wrong direction and will trigger only irrelevant answer. Hume, living still in times of absolute monarchism, in a society deeply structured by religious beliefs, established a short-cut between the frequency of an observation and its propositional representation. The actual question, however, is how to achieve what we call an “observation”.

In very simple, almost artificial cases like the die there is nothing to interpret. The die and its values are already symbols. It is in some way inadequate to conceive of a die or of dicing as an empirical issue. In fact, we know before what could happen. The universe of the die consists of precisely 6 singular points.

Another extreme are so-called single-case observations of structurally rich events, or processes. An event, or a setting should be called structurally rich, if there are (1) many different outcomes, and (2) many possible assignates to describe the event or the process. Such events or processes will not produce any outcome that is could be expected by symbolic or formal considerations. Obviously, it is not possible to assign a relative frequency to a unique, a singular, or a non-repeatable event. Unfortunately, however, as Hájek points out [17], any actual sequence can be conceived of as a singular event.

The important point now is that single-case observations are also not sufficiently describable as an empirical issue. Ascribing propensities to objects-in-the-world demands for a wealth of modeling activities and classifications, which have to be completed apriori to the observation under scrutiny. So-called single-case propensities are not a problem of probabilistic theory, but one of the application of intensional classes and their usage as means for organizing one’s own expectations. As we said earlier, probability as it is used in probability theory is not a concept that could be applied meaningful to observations, where observations are conceived of as primitive “givens”. Probabilities are meaningful only in the closed world of available subjectively held concepts.

We thus have to distinguish between two areas of application for the concept of probability: the observational part, where we build up classes, and the anticipatory part, where we are interested in a match of expectations and actual outcomes. The problem obviously arises by mixing them through the notion of causality.17 Yet, there is absolutely no necessity between the two areas. The concept of risk probably allows for a resolution of the problems, since risk always implies a preceding choice of a cost function, which necessarily is subjective. Yet, the cost function and the risk implied by a classification model is also the angle point for any kind of negotiation, whether this takes place on an material, hence evolutionary scale, or within a societal context.

The interesting, if not salient point is that the subjectively available intensional descriptions and classes are dependent on ones risk attitude. We may observe the same thing only  if we have acquired the same system of related classes and the same habits of using them. Only if we apply extreme risk aversion we will achieve a common understanding about facts (in the Wittgensteinian sense, see above). This then is called science, for instance. Yet, it still remains a misunderstanding to equate this common understanding with objects as objects-out-there.

The problem of induction thus must be considered as a seriously  ill-posed problem. It is a problem only for idealists (who then solve it in a weird way), or realists that are naive against the epistemological conditions of acting in the world. Our proposal for the transition from observations to descriptions is based on probabilism on both sides, yet, on either side there is a distinct flavor of probabilism.

Finally, a methodological remark shall be allowed, closely related to what we already described in the section about “noise” above. The perspective onto “making experience” that we have been proposing here demonstrates a significant twist.

Above we already mentioned Alan Hájek’s diagnosis that the frequentist and the Bayesian interpretation of probabilities suffer from the reference class problem. In this section we extended Hájek’s concerns to the concept of propensity. Yet, if the problem shows a high prevalence we should not conceive it as a hurdle but should try to treat it dynamically as a rule.The reference class is only a problem as long as (1) either the actual class is required as an external constant, or (2) the abstract concept of the class is treated as a fixed point. According to the rule of Lagrange-Deleuze, any constant can be rewritten into a procedure (read: rules) and less problematic constants. Constants, or fixed points on a higher abstract level are less problematic, because the empirically grounded semantics vanishes.

Indeed, the problem of the reference class simply disappears if we put the concept of the class, together with all the related issues of modeling, as the embedding frame, the condition under which any notion of probability only can make sense at all. The classes itself are results of “rule-following”, which  admittedly is blind, but whose parameters are also transparently accessible. In this way, probabilistic interpretation is always performed in a universe, that is closed and in principle fully mapped. We need the probabilistic methods just because that universe is of a huge size. In other words, the space of models is a Laplacean Universe.

Since statistical methods and similar interpretations of probability are analytical techniques, our proposal for a re-positioning of statistics into such a Laplacean Universe is also well aligned with the general habit of Wittgenstein’s philosophy, which puts practiced logic (quasi-logic) second to performance.

The disappearance of the reference class problem should be expected if our relations to the world are always mediated through the activity with abstract, epistemic modeling. The usage of probability theory as a “conceptual game” aiming for sharing diverging attitudes towards risks appears as nothing else than just a particular style of modeling, though admittedly one that offers a reasonable rate of success.

The Result of Modeling

It should be clear by now, that the result of modeling is much more than just a single predictive model. Regardless whether we take the scientific perspective or a philosophical vantage point, we need to include operationalizations of the conditions of the model, that reach beyond the standard empirical risk expressed as “false classification”. Appropriate modeling provides not only a set of models with well-estimated stability and of different structures; a further goal is to establish models that are auto-consistent.

If the modeling employs a method that exposes its parameters, we even can avoid the „method hell“, that is, the results are not only reliable, they are also valid.

It is clear that only auto-consistent models are useful for drawing conclusions and in building up experience. If variables are just weighted without actually being removed, as for instance in approaches like the Support Vector Machines, the resulting methods are not auto-consistent. Hence, there is no way towards a propositional description of the observed process.

Given the population of explicitly tested models it is also possible to describe the differential contribution of any variable to the predictive power of a model. The assumption of neutrality or symmetry of that contribution, as it is for instance applied in statistical learning, is a simplistic perspective onto the variables and the system represented by them.

Conclusion

In this essay we described some technical aspects of the capability to experience. These technical aspects link the possibility for experience to the primacy of interpretation that gets actualized as the techné of anticipatory, i.e. predictive or diagnostic modeling. This techné does not address the creation or derivation of a particular model by means of employing one or several methods. The process of building a model could be fully automated anyway. Quite differently, it focuses the parametrization, validation, evaluation and application of models, particularly with respect to the task of extract a rule from observational data. This extraction of rules must not be conceived as a “drawing of conclusions” guided by logic. It is a constructive activity.

The salient topics in this practice are the selection of models and the description of the classifiers. We emphasized that the goal of modeling should not be conceived as the task of finding a single best model.

Methods like the Self-organizing Map which are based on sub-group segmentation of the data can be used to create auto-consistent models, which represent also an optimally de-noised subset of the measured data. This data sample could be conceived as if it would have been found by a factorial experimental design. Thus, auto-consistent models also provide quite valuable hints for the setup of the Taguchi method of quality assurance, which could be seen as a precipitation of organizational experience.

In the context of exploratory investigation of observational data one first has to determine the suitable observables (variables, predictors) and, by means of the same model(s), the suitable segment of observations before drawing domain-specific conclusions. Such conclusions are often expressed as contrasts in location or variation. In the context of designed experiments as e.g. in pharmaceutical research one first has to check the quality of the data, then to de-noise the data by removing outliers by means of the same data segmentation technique, before again null hypotheses about expected contrasts could be tested.

As such, auto-consistent models provide a perfect basis for learning and for extending the “experience” of an epistemic individual. According to our proposals this experience does not suffer from the various problems of traditional Humean empirism (the induction problem), or contemporary (defective) theories of probabilism (mainly the problem of reference classes). Nevertheless, our approach remains fully empirico-epistemological.

Notes

1. As many other philosophers Lyotard emphasized the indisputability of an attention for the incidential, not as a perception-as, but as an aisthesis, as a forming impression. see: Dieter Mersch, ›Geschieht es?‹ Ereignisdenken bei Derrida und Lyotard. available online, last accessed May 1st, 2012. Another recent source arguing into the same direction is John McDowell’s “Mind and World” (1996).

2. The label “representationalism” has been used by Dreyfus in his critique of symbolic AI, the thesis of the “computational mind” and any similar approach that assumes (1) that the meaning of symbols is given by their reference to objects, and (2) that this meaning is independent of actual thoughts, see also [2].

3. It would be inadequate to represent such a two-fold “almost” dichotomy as a 2-axis coordinate system, even if such a representation would be a metaphorical one only; rather, it should be conceived as a tetraedic space, given by two vectors passing nearby without intersecting each other. Additionally, the structure of that space must not expected to be flat, it looks much more like an inhomogeneous hyperbolic space.

4. “Experiencibility” here not understood as an individual capability to witness or receptivity, but as the abstract possibility to experience.

5. In the same way we reject Husserl’s phenomenology. Phenomena, much like the objects of positivism or the thing-as-such of idealism, are not “out there”, they are result of our experiencibility. Of course, we do not deny that there is a materiality that is independent from our epistemic acts, but that does not explain or describe anything. In other words we propose go subjective (see also [3]).

6. Again, mechanism here should not be misunderstood as a single deterministic process as it could be represented by a (trivial) machine.

7. This question refers to the famous passage in the Tractatus, that “The world is everything that is the case.“ Cases, in the terminology of the Tractatus, are facts as the existence of states of affairs. We may say, there are certain relations. In the Tractatus, Wittgenstein excluded relations that could not be explicated by the use of symbols., expressed by the 7th proposition: „Whereof one cannot speak, thereof one must be silent.“

8. We must step outside of language in order to see the working of language.

9. We just have to repeat it again, since many people develop misunderstandings here. We do not deny the material aspects of the world.

10. “individual” is quite misleading here, since our brain and even our mind is not in-divisable in the atomistic sense.

11. thus, it is also not reasonable to claim the existence of a somehow dualistic language, one part being without ambiguities and vagueness, the other one establishing ambiguity deliberately by means of metaphors. Lakoff & Johnson started from a similar idea, yet they developed it into a direction that is fundamentally incompatible with our views in many ways.

12. Of course, the borders are not well defined here.

13. “predictive power” could be operationalized in quite different ways, of course….

14. Correlational analysis is not a candidate to resolve this problem, since it can’t be used to segment the data or to identify groups in the data. Correlational analysis should be performed only subsequent to a segmentation of the data.

15. The so-called genetic algorithms are not algorithms in the narrow sense, since there is no well-defined stopping rule.

16. It is important to recognize that Artificial Neural Networks are NOT belonging to the family of sub-group based methods.

17. Here another circle closes: the concept of causality can’t be used in a meaningful way without considering its close amalgamation with the concept of information, as we argued here. For this reason, Judea Pearl’s approach towards causality [16] is seriously defective, because he completely neglects the epistemic issue of information.

References
  • [1] Geoffrey C. Bowker, Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. MIT Press, Boston 1999.
  • [2] Willian Croft, Esther J. Wood, Construal operations in linguistics and artificial intelligence. in: Liliana Albertazzi (ed.) , Meaning and Cognition. Benjamins Publ, Amsterdam 2000.
  • [3] Wilhelm Vossenkuhl. Solipsismus und Sprachkritik. Beiträge zu Wittgenstein. Parerga, Berlin 2009.
  • [4] Douglas Hofstadter, Fluid Concepts And Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought. Basic Books, New York 1996.
  • [5] Nicholas F. Gier, Wittgenstein and Deconstruction, Review of Contemporary Philosophy 6 (2007); first publ. in Nov 1989. Online available.
  • [6] Henk L. Mulder, B.F.B. van de Velde-Schlick (eds.), Moritz Schlick, Philosophical Papers, Volume II: (1925-1936), Series: Vienna Circle Collection, Vol. 11b, Springer, Berlin New York 1979. with Google Books
  • [7] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee & Partha Niyogi (2004). General conditions for predictivity in learning theory. Nature 428, 419-422.
  • [8]  Vladimir Vapnik, The Nature of Statistical Learning Theory (Information Science and Statistics). Springer 2000.
  • [9] Herman J. Bierens (2006). Information Criteria and Model Selection. Lecture notes, mimeo, Pennsylvania State University. available online.
  • [10 ]Brian Weatherson (2007). The Bayesian and the Dogmatist. Aristotelian Society Vol.107, Issue 1pt2, 169–185. draft available online
  • [11] Edward I. George (2000). The Variable Selection Problem. J Am Stat Assoc, Vol. 95 (452), pp. 1304-1308. available online, as research paper.
  • [12] Alan Hájek (2007). The Reference Class Problem is Your Problem Too. Synthese 156(3): 563-585. draft available online.
  • [13] Lori E. Dodd, Margaret S. Pepe (2003). Partial AUC Estimation and Regression. Biometrics 59( 3), 614–623.
  • [14] Zytkov J. (1997). Knowledge=concepts: a harmful equation. 3rd Conference on Knowledge Discovery in Databases, Proceedings of KDD-97, p.104-109.AAAI Press.
  • [15] Thomas Kaufmann, Klaus Wassermann, Guido Schüpfer (2007).  Beta error free risk identification based on SPELA, a neuro-evolution method. presented at ESA 2007.
  • [16] Alan Hájek, “Interpretations of Probability”, The Stanford Encyclopedia of Philosophy (Summer 2012 Edition), Edward N. Zalta (ed.), available online, or forthcoming.
  • [17] Judea Pearl, Causality – Models, Reasoning, and Inference. 2nd ed. Cambridge University Press, Cambridge  (Mass.) 2008 [2000].

۞

Ideas and Machinic Platonism

March 1, 2012 § Leave a comment

Once the cat had the idea to go on a journey…
You don’t believe me? Did not your cat have the same idea? Or is your doubt about my believe that cats can have ideas?

So, look at this individual here, who is climbing along the facade, outside the window…

(sorry for the spoken comment being available only in German language in the clip, but I am quite sure you got the point anyway…)

Cats definitely know about the height of their own position, and this one is climbing from flat to flat … outside, on the facade of the building, and in the 6th floor. Crazy, or cool, respectively, in its full meaning, this cat here, since it looks like she has been having a plan… (of course, anyone ever lived together with a cat knows very well that they can have plans… proudness like this one, and also remorse…)

Yet, how would your doubts look like, if I would say “Once the machine got the idea…” ? Probably you would stop talking or listening to me, turning away from this strange guy. Anyway, just that is the claim here, and hence I hope you keep reading.

We already discussed elsewhere1 that it is quite easy to derive a bunch of hypotheses about empirical data. Yet, deriving regularities or rules from empirical data does not make up an idea, or a concept. At most they could serve as kind of qualified precursors for the latter. Once the subject of interest has been identified, deriving hypotheses about it is almost something mechanical. Ideas and concepts as well are much more related to the invention of a problematics, as Deleuze has been working out again and again, without being that invention or problematics. To overlook (or to negate?) that difference between the problematic and the question is one of the main failures of logical empiricism, and probably even of today’s science.

The Topic

But what is it then, that would make up an idea, or a concept? Douglas Hofstadter once wrote [1] that we are lacking a concept of concept. Since then, a discipline emerged that calls itself “formal concept analysis”. So, actually some people indeed do think that concepts could be analyzed formally. We will see that the issues about the relation between concepts and form are quite important. We already met some aspects of that relationship in the chapters about formalization and creativity. And we definitely think that formalization expels anything interesting from that what probably had been a concept before that formalization. Of course, formalization is an important part in thinking, yet it is importance is restricted before it there are concepts or after we have reduced them into a fixed set of finite rules.

Ideas

Ideas are almost annoying, I mean, as a philosophical concept, and they have been so since the first clear expressions of philosophy. From the very beginning there was a quarrel not only about “where they come from,” but also about their role with respect to knowledge, today expressed as . Very early on in philosophy two seemingly juxtaposed positions emerged, represented by the philosophical approaches of Platon and Aristotle. The former claimed that ideas are before perception, while for the latter ideas clearly have been assigned the status of something derived, secondary. Yet, recent research emphasized the possibility that the contrast between them is not as strong as it has been proposed for more than 2000 years. There is an eminent empiric pillar in Platon’s philosophical building [2].

We certainly will not delve into this discussion here, it simply would take too much space and efforts, and not to the least there are enough sources in the web displaying the traditional positions in great detail. Throughout history since Aristotle, many and rather divergent flavors of idealism emerged. Whatever the exact distinctive claim of any of those positions is, they all share the belief in the dominance into some top-down principle as essential part of the conditions for the possibility of knowledge, or more general the episteme. Some philosophers like Hegel or Frege, just as others nowadays being perceived as members of German Idealism took rather radical positions. Frege’s hyper-platonism, probably the most extreme idealistic position (but not exceeding Hegel’s “great spirit” that far) indeed claimed that something like a triangle exists, and quite literally so, albeit in a non-substantial manner, completely independent from any, e.g. human, thought.

Let us fix this main property of the claim of a top-down principle as characteristic for any flavor of idealism. The decisive question then is how could we think the becoming of ideas.It is clearly one of the weaknesses of idealistic positions that they induce a salient vulnerability regarding the issue of justification. As a philosophical structure, idealism mixes content with value in the structural domain, consequently and quite directly leading to a certain kind of blind spot: political power is justified by the right idea. The factual consequences have been disastrous throughout history.

So, there are several alternatives to think about this becoming. But even before we consider any alternative, it should be clear that something like “becoming” and “idealism” is barely compatible. Maybe, a very soft idealism, one that already turned into pragmatism, much in the vein of Charles S. Peirce, could allow to think process and ideas together. Hegel’s position, or as well Schelling’s, Fichte’s, Marx’s or Frege’s definitely exclude any such rapprochement or convergence.

The becoming of ideas could not thought as something that is flowing down from even greater transcendental heights. Of course, anybody may choose to invoke some kind of divinity here, but obviously that does not help much. A solution according to Hegel’s great spirit, history itself, is not helpful either, even as this concept implied that there is something in and about the community that is indispensable when it comes to thinking. Much later, Wittgenstein took a related route and thereby initiated the momentum towards the linguistic turn. Yet, Hegel’s history is not useful to get clear about the becoming of ideas regarding the involved mechanism. And without such mechanisms anything like machine-based episteme, or cats having ideas, is accepted as being impossible apriori.

One such mechanism is interpretation. For us the principle of the primacy of interpretation is definitely indisputable. This does not mean that we disregard the concept of the idea, yet, we clearly take an Aristotelian position. More á jour, we could say that we are quite fond of Deleuze’s position on relating empiric impressions, affects, and thought. There are, of course many supporters in the period of time that span between Aristotle and Deleuze who are quite influential for our position.2
Yet, somehow it culminated all in the approach that has been labelled French philosophy, and which for us comprises mainly Michel Serres, Gilles Deleuze and Michel Foucault, with some predecessors like Georges Simondon. They converged towards a position that allow to think the embedding of ideas in the world as a process, or as an ongoing event [3,4], and this embedding is based on empiric affects.

So far, so good. Yet, we only declared the kind of raft we will build to sail with. We didn’t mention anything about how to build this raft or how to sail it. Before we can start to constructively discuss the relation between machines and ideas we first have to visit the concept, both as an issue and as a concept.

Concepts

“Concept” is very special concept. First, it is not externalizable, which is why we call it a strongly singular term. Whenever one thinks “concept,” there is already something like concept. For most of the other terms in our languages, such as idea, that does not hold. Such, and regarding the structural dynamics of its usage,”concept” behave similar to “language” or “formalization.”

Additionally, however, “concept” is not self-containing term like language. One needs not only symbols, one even needs a combination of categories and structured expression, there are also Peircean signs involved, and last but not least concepts relate to models, even as models are also quite apart from it. Ideas do not relate in the same way to models as concepts do.

Let us, for instance take the concept of time. There is this abundantly cited quote by  Augustine [5], a passage where he tries to explain the status of God as the creator of time, hence the fundamental incomprehensibility of God, and even of his creations (such as time) [my emphasis]:

For what is time? Who can easily and briefly explain it? Who even in thought can comprehend it, even to the pronouncing of a word concerning it? But what in speaking do we refer to more familiarly and knowingly than time? And certainly we understand when we speak of it; we understand also when we hear it spoken of by another. What, then, is time? If no one ask of me, I know; if I wish to explain to him who asks, I know not. Yet I say with confidence, that I know that if nothing passed away, there would not be past time; and if nothing were coming, there would not be future time; and if nothing were, there would not be present time.

I certainly don’t want to speculate about “time” (or God) here, instead I would like to focus this peculiarity Augustine is talking about. Many, and probably even Augustine himself, confine this peculiarity to time (and space). I think, however, this peculiarity applies to any concept.

By means of this example we can quite clearly experience the difference between ideas and concepts. Ideas are some kind of models—we will return that in the next section—, while concepts are the both the condition for models and being conditioned by models. The concept of time provides the condition for calendars, which in turn can be conceived as a possible condition for the operationalization of expectability.

“Concepts” as well as “models” do not exist as “pure” forms. We elicit a strange and eminently counter-intuitive force when trying to “think” pure concept or models. The stronger we try, the more we imply their “opposite”, which in case of concepts presumably is the embedding potentiality of mechanisms, and in case of models we could say it is simply belief. We will discuss the issue of these relation in much more detail in the chapter about the choreosteme (forthcoming). Actually, we think that it is appropriate to conceive of terms like “concept” and “model” as choreostemic singular terms, or short choreostemic singularities.

Even from an ontological perspective we could not claim that there “is” such a thing like a “concept”. Well, you may already know that we refute any ontological approach anyway. Yet, in case of choreostemic singular terms like “concept” we can’t simply resort to our beloved language game. With respect to language, the choreosteme takes the role of an apriori, something like the the sum of all conditions.

Since we would need a full discussion of the concept of the choreosteme we can’t fully discuss the concept of “concept” here.  Yet, as kind of a summary we may propose that the important point about concepts is that it is nothing that could exist. It does not exist as matter, as information, as substance nor as form.

The language game of “concept” simply points into the direction of that non-existence. Concepts are not a “thing” that we could analyze, and also nothing that we could relate to by means of an identifiable relation (as e.g. in a graph). Concepts are best taken as gradient field in a choreostemic space, yet, one exhibiting a quite unusual structure and topology. So far, we identified two (of a total of four) singularities that together spawn the choreostemic space. We also could say that the language game of “concept” is used to indicate a certain form of a drift in the choreostemic space. (Later we also will discuss the topology of that space, among many other issues.)

For our concerns here in this chapter, the machine-based episteme, we can conclude that it would be a misguided approach to try to implement concepts (or their formal analysis). The issue of the conditions for the ability to move around in the choreostemic space we have to postpone. In other words, we have confined our task, or at least, we found a suitable entry  point for our task, the investigation of the relation between machines and ideas.

Machines and Ideas

When talking about machines and ideas we are, here and for the time being, not interested in the usage of machines to support “having” ideas. We are not interested in such tooling for now. The question is about the mechanism inside the machine that would lead to the emergence of ideas.

Think about the idea of a triangle. Certainly, triangles as we imagine them do not belong to the material world. Any possible factual representation is imperfect, as compared with the idea. Yet, without the idea (of the triangle) we wouldn’t be able to proceed, as, for instance, towards land survey. As already said, ideas serve as models, they do not involve formalization, they often live as formalization (though not always a mathematical one) in the sense of an idealized model, in other words they serve as ladder spokes for actions. Concepts, if we in contrast them to ideas, that is, if we try to distinguish them, never could be formalized, they remain inaccessible as condition. Nothing else could be expected  from a transcendental singularity.

Back to our triangle. Despite we can’t represent them perfectly, seeing a lot of imperfect triangles gives rise to the idea of the triangle. Rephrased in this way, we may recognize that the first half of the task is to look for a process that would provide an idealization (of a model), starting from empirical impressions. The second half of the task is to get the idea working as a kind of template, yet not as a template. Such an abstract pattern is detached from any direct empirical relation, despite the fact that once we started with with empiric data.

Table 1: The two tasks in realizing “machinic idealism”

Task 1: process of idealization that starts with an intensional description
Task 2: applying the idealization for first-of-a-kind-encounters

Here we should note that culture is almost defined by the fact that it provides such ideas before any individual person’s possibility to collect enough experience for deriving them on her own.

In order to approach these tasks, we need first model systems that exhibit the desired behavior, but which also are simple enough to comprehend. Let us first deal with the first half of the task.

Task 1: The Process of Idealization

We already mentioned that we need to start from empirical impressions. These can be provided by the Self-organizing Map (SOM), as it is able to abstract from the list of observations (the extensions), thereby building an intensional representation of the data. In other words, the SOM is able to create “representative” classes. Of course, these representations are dependent on some parameters, but that’s not the important point here.

Once we have those intensions available, we may ask how to proceed in order to arrive at something that we could call an idea. Our proposal for an appropriate model system consists from the following parts:

  • (1) A small set (n=4) of profiles, which consist of 3 properties; the form of the profiles is set apriori such that they overlap partially;
  • (2) a small SOM, here with 12×12=144 nodes; the SOM needs to be trainable and also should provide classification service, i.e. acting as a model
  • (3) a simple Monte-Carlo-simulation device, that is able to create randomly varied profiles that deviate from the original ones without departing too much;
  • (4) A measurement process that is recording the (simulated) data flow

The profiles are defined as shown in the following table (V denotes variables, C denotes categories, or classes):

V1 V2 V3
C1 0.1 0.4 0.6
C2 0.8 0.4 0.6
C3 0.3 0.1 0.4
C4 0.2 0.2 0.8

From these parts we then build a cyclic process, which comprises the following steps.

  • (0) Organize some empirical measurement for training the SOM; in our model system, however, we use the original profiles and create an artificial body of “original” data, in order to be able to detect the relevant phenomenon (we have perfect knowledge about the measurement);
  • (1) Train the SOM;
  • (2) Check the intensional descriptions for their implied risk (should be minimal, i.e. beyond some threshold) and extract them as profiles;
  • (3) Use these profiles to create a bunch of simulated (artificial) data;
  • (4) Take the profile definitions and simulate enough records to train the SOM,

Thus, we have two counteracting forces, (1) a dispersion due to the randomizing simulation, and (2) the focusing of the SOM due to the filtering along the separability, in our case operationalized as risk (1/ppv=positive predictive value) per node. Note that the SOM process is not a directly re-entrant process as for instance Elman networks [6,7,8].3

This process leads not only to a focusing contrast-enhancement but also to (a limited version) of inventing new intensional descriptions that never have been present in the empiric measurement, at least not salient enough to show up as an intension.

The following figure 1a-1i shows 9 snapshots from the evolution of such a system, it starts top-left of the portfolio, then proceeds row-wise from left to right down to the bottom-right item. Each of the 9 items displays a SOM, where the RGB-color corresponds to the three variables V1, V2, V3. A particular color thus represents a particular profile on the level of the intension. Remember, that the intensions are built from the field-wise average across all the extensions collected by a particular node.

Well, let us now contemplate a bit about the sequence of these panels, which represents the evolution of the system. The first point is that there is no particular locational stability. Of course, not, I am tempted to say, since a SOM is not an image that represents as image. A SOM contains intensions and abstractions, the only issue that counts is its predictive power.

Now, comparing the colors between the first and the second, we see that the green (top-right in 1a, middle-left in 1b) and the brownish (top-left in 1a, middle-right in 1b) appear much more clear in 1b as compared to 1a. In 1a, the green obviously was “contaminated” by blue, and actually by all other values as well, leading to its brightness. This tendency prevails. In 1c and 1d yellowish colors are separated, etc.

Figure 1a thru 1i: A simple SOM in a re-entrant Markov process develops idealization. Time index proceeds from top-left to bottom-right.

The point now is that the intensions contained in the last SOM (1i, bottom-right of the portfolio) have not been recognizable in the beginning, in some important respect they have not been present. Our SOM steadily drifted away from its empirical roots. That’s not a big surprise, indeed, for we used a randomization process. The nice thing is something different: the intensions get “purified”, changing thereby their status from “intensions” to “ideas”.

Now imagine that the variables V1..Vn represent properties of geometric primitives. Our sensory apparatus is able to perceive and to encode them: horizontal lines, vertical lines, crossings, etc. In empiric data our visual apparatus may find any combination of those properties, especially in case of a (platonic) school (say: academia) where the pupils and the teachers draw triangles over triangles into the wax tablets, or into the sand of the pathways in the garden…

By now, the message should be quite clear: there is nothing special about ideas. In abstract terms, what is needed is

  • (1) a SOM-like structure;
  • (2) a self-directed simulation process;
  • (3) re-entrant modeling

Notice that we need not to specify a target variable. The associative process itself is just sufficient.

Given this model it should not surprise anymore why the first philosophers came up with idealism. It is almost built into the nature of the brain. We may summarize our achievements in the following characterization;

Ideas can be conceived as idealizations of intensional descriptions.

It is of course important to be aware of the status of such a “definition”. First, we tried to separate concepts and ideas. Most of the literature about ideas conflate them.Yet, as long as they are conflated, everything and any reasoning about mental affairs, cognition, thinking and knowledge necessarily remains inappropriate. For instance, the infamous discourse about universals and qualia seriously suffered from that conflation, or more precisely, they only arose due to that mess.

Second, our lemma is just an operationalization, despite the fact that we are quite convinced about its reasonability. Yet, there might be different ones.

Our proposal has important benefits though, as it matches a lot of the aspects commonly associated the the term “idea.” In my opinion, what is especially striking about the proposed model is the observation that idealization implicitly also led to the “invention” of “intensions” that were not present in the empiric data. Who would have been expecting that idealization is implicitly inventive?

Finally, two small notes should be added concerning the type of data and the relationship between the “idea” as a continuously intermediate result of the re-entrant SOM process. One should be aware that the “normal” input to natural associative systems are time series. Our brain is dealing with a manifold of series of events, which is mapped onto the internal processes, that is, onto another time-based structure. Prima facie Our brain is not dealing with tables. Yet, (virtual) tabular structures are implied by the process of propertization, which is an inevitable component of any kind of modeling. It is well-known that is is time-series data and their modeling that give rise to the impression of causality. In the light of ideas qua re-entrant associativity, we now can easily understand the transition from networks of potential causal influences to the claim of “causality” as some kind of a pure concept. Despite the idea of causality (in the Newtonian sense) played an important role in the history of science, it is just that: a naive idealization.

The other note concerns the source of the data.  If we consider re-entrant informational structures that are arranged across large “distances”, possibly with several intermediate transformative complexes (for which there are hints from neurobiology) we may understand that for a particular SOM (or SOM-like structure) the type of the source is completely opaque. To put it short, it does not matter for our proposed mechanism whether the data are sourced as empiric data from the external world,or as some kind of simulated, surrogated re-entrant data from within the system itself. In such wide-area, informationally re-entrant probabilistic networks we may expect kind of a runaway idealization. The question then is about the minimal size necessary for eliciting that effect. A nice corollary of this result is the insight that logistic networks, such like the internet or the telephone wiring cable NEVER will start to think on itself, as some still expect. Yet, since there a lot of brains as intermediate transforming entities embedded in this deterministic cablework, we indeed may expect that the whole assembly is much more than could be achieved by a small group of humans living, say around 1983. But that is not really a surprise.

Task 2: Ideas, applied

Ideas are an extremely important structural phenomenon, because they allow to recognize things and to deal with tasks that we never have seen before. We may act adaptively before having encountered a situation that would directly resemble—as equivalence class—any intensional description available so far.

Actually, it is not just one idea, it is a “system” of ideas that is needed for that. Some years ago, Douglas Hofstadter and his group3 devised a model system suitable for demonstrating exactly this: the application of ideas. They called the project (and the model system) Copycat.

We won’t discuss Copycat and analogy-making rules by top-down ideas here (we already introduced it elsewhere). We just want to note that the central “platonic” concept in Copycat is a dynamic relational system of symmetry relations. Such symmetry relations are for instance “before”, “after”, or “builds a group”, “is a triple”, etc. Such kind of relations represent different levels of abstractions, but that’s not important. Much more important is the fact that the relations between these symmetry relations are dynamic and will adapt according to the situation at hand.

I think that these symmetry relations as conceived by the Fargonauts are on the same level as our ideas. The transition from ideas to symmetries is just a grammatological move.

The case of Biological Neural Systems

Re-entrance seems to be an important property of natural neural networks. Very early on in the liaison of neurobiology and computer science, starting with Hebb and Hopfield in the beginning of the 1940ies, recurrent networks have been attractive for researchers. If we take a look to drawings like the following, created (!) by Ramon y Cajal [10] in the beginning of the 20th century.

Figure 2a-2c: Drawings by Ramon y Cajal, the Spain neurobiologist. See also:  History of Neuroscience. a: from a Sparrow’s brain, b: motor brain in human brain, c: Hypothalamus in human brain

Yet, Hebb, Hopfield and Elman got trapped by the (necessary) idealization of Cajal’s drawings. Cajal’s interest was to establish and to proof the “neuron hypothesis”, i.e. that brains work on the basis of neurons. From Cajal’s drawings to the claim that biological neuronal structures could be represented by cybernetic systems or finite state machines is, honestly, a breakneck, or, likewise, ideology.

Figure 3: Structure of an Elman Network; obviously, Elman was seriously affected by idealization (click for higher resolution).

Thus, we propose to distinguish between re-entrant and recurrent networks. While the latter are directly wired onto themselves in a deterministic manner, that is the self-reference is modeled on the morphological level, the former are modeled on the  informational level. Since it is simply impossible for cybernetic structure to reflect neuromorphological plasticity and change, the informational approach is much more appropriate for modeling large assemblies of individual “neuronal” items (cf. [11]).

Nevertheless, the principle of re-entrance remains a very important one. It is a structure that is known to lead to contrast enhancement and to second-order memory effects. It is also a cornerstone in the theory (theories) proposed by Gerald Edelman, who probably is much less affected by cybernetics (e.g. [12]) than the authors cited above. Edelman always conceived the brain-mind as something like an abstract informational population; he even was the first adopting evolutionary selection processes (Darwinian and others) to describe the dynamics in the brain-mind.

Conclusion: Machines and Choreostemic Drift

Out point of departure was to distinguish between ideas and concepts. Their difference becomes visible if we compare them, for instance, with regard to their relation to (abstract) models. It turns out that ideas can be conceived as a more or less stable immaterial entity (though not  “state”) of self-referential processes involving self-organizing maps and the simulated surrogates of intensional descriptions. Concepts on the other hand are described as a transcendental vector in choreostemic processes. Consequently, we may propose only for ideas that we can implement their conditions and mechanisms, while concepts can’t be implemented. It is beyond the expressibility of any technique to talk about the conditions for their actualization. Hence, the issue of “concept” has been postponed to a forthcoming chapter.

Ideas can be conceived as the effect of putting a SOM into a reentrant context, through which the SOM develops a system of categories beyond simple intensions. These categories are not justified by empirical references any more, at least not in the strong sense. Hence, ideas can be also characterized as being clearly distinct from models or schemata. Both, models and schemata involve classification, which—due to the dissolved bonds to empiric data—can not be regarded a sufficient component for ideas. We would like to suggest the intended mechanism as the candidate principle for the development ideas. We think that the simulated data in the re-entrant SOM process should be distinguished from data in contexts that are characterized by measurement of “external” objects, albeit their digestion by the SOM mechanism itself remains the same.

From what has been said it is also clear that the capability of deriving ideas alone is still quite close to the material arrangements of a body, whether thought as biological wetware or as software. Therefore, we still didn’t reach a state where we can talk about epistemic affairs. What we need is the possibility of expressing the abstract conditions of the episteme.

Of course, what we have compiled here exceeds by far any other approach, and additionally we think that it could serve as as a natural complement to the work of Douglas Hofstadter. In his work, Hofstadter had to implement the platonic heavens of his machine manually, and even for the small domain he’d chosen it has been a tedious work. Here we proposed the possibility for a seamless transition from the world of associative mechanisms like the SOM to the world of platonic Copy-Cats, and “seamless” here refers to “implementable”.

Yet, what is really interesting is the form of choreostemic movement or drift, resulting from a particular configuration of the dynamics in systems of ideas. But this is another story, perhaps related to Felix Guattari’s principle of the “machinic”, and it definitely can’t be implemented any more.

.
Notes

1. we did so in the recent chapter about data and their transformation, but also see the section “Overall Organization” in Technical Aspects of Modeling.

2. You really should be aware that this trace we try to put forward here does not come close to even a coarse outline of all of the relevant issues.

3. they called themselves the “Fargonauts”, from FARG being the acronym for “Fluid Analogy Research Group”.

4. Elman networks are an attempt to simulate neuronal networks on the level of neurons. Such approaches we rate as fundamentally misguided, deeply inspired by cybernetics [9], because they consider noise as disturbance. Actually, they are equivalent to finite state machines. It is somewhat ridiculous to consider a finite state machine as model for learning “networks”. SOM, in contrast, especially if used in architectures like ours, are fundamentally probabilistic structures that could be regarded as “feeding on noise.” Elman networks, and their predecessor, the Hopfield network are not quite useful, due to problems in scalability and, more important, also in stability.

  • [1] Douglas Hofstadter, Douglas R. Hofstadter, Fluid Concepts And Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought. Basic Books, New York 1996.  p.365
  • [2] Gernot Böhme, “Platon der Empiriker.” in: Gernot Böhme, Dieter Mersch, Gregor Schiemann (eds.), Platon im nachmetaphysischen Zeitalter. Wissenschaftliche Buchgesellschaft, Darmstadt 2006.
  • [3] Marc Rölli (ed.), Ereignis auf Französisch: Von Bergson bis Deleuze. Fin, Frankfurt 2004.
  • [4] Gilles Deleuze, Difference and Repetition. 1967
  • [5] Augustine, Confessions, Book 11 CHAP. XIV.
  • [6] Mandic, D. & Chambers, J. (2001). Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley.
  • [7] J.L. Elman, (1990). Finding Structure in Time. Cognitive Science 14 (2): 179–211.
  • [8] Raul Rojas, Neural Networks: A Systematic Introduction. Springer, Berlin 1996. (@google books)
  • [9] Holk Cruse, Neural Networks As Cybernetic Systems: Science Briefings, 3rd edition. Thieme, Stuttgart 2007.
  • [10] Santiago R.y Cajal, Texture of the Nervous System of Man and the Vertebrates: Volume I: 1, Springer, Wien 1999, edited and translated by Pedro Pasik & Tauba Pasik. see google books
  • [11] Florence Levy, Peter R. Krebs (2006), Cortical-Subcortical Re-Entrant Circuits and Recurrent Behaviour. Aust N Z J Psychiatry September 2006 vol. 40 no. 9 752-758.
    doi: 10.1080/j.1440-1614.2006.01879
  • [12] Gerald Edelman: “From Brain Dynamics to Consciousness: A Prelude to the Future of Brain-Based Devices“, Video, IBM Lecture on Cognitive Computing, June 2006.

۞

Data

February 28, 2012 § Leave a comment

There are good reasons to think that data appear

as the result of friendly encounters with the world.

Originally, “data” has been conceived as the “given”, or as things that are given, if we follow the etymological traces. That is not quite surprising since it is closely related to the concept of date as a point in time. And what if not time could be something that is given? The concept of date is, on the the other, related to the computation, at least, if we consider etymology again. Towards the end of the medieval ages, the problems around the calculation of the next Easter date(s) triggered the first institutionalized recordings of rule-based approaches that have been called “computation.” At those times, it already has been a subject for specialists…

Yet, the cloud of issues around data also involves things. But “things” are nothing that are invariably given, so to speak as a part of an independent nature. In Nordic languages there is a highly interesting link to constructivism. Things originally denoted some early kind of parliament. The Icelandic “alþingi”, or transposed “Althingi” is the oldest parliamentary institution in the world still extant, founded in 930. If we take this thread further it is clear that things refer to entities that have been recognized by the community as subject for standardization. That’s the job of parliaments or councils. Said standardization comprises the name, rules for recognizing it, and rules for using or applying it, or simply, how to refer to it, e.g. as part of a semiosic process. That is, some kind of legislation, or norming, if not to say normalization. (That’s not a bad thing in itself, only if a society is too eager in doing so, standardization is a highly relevant condition for developing higher complexity, see here) And, back to the date, we fortunately know also about a quite related usage of the “date” as in “dating” or to make a date, in other words, to fix the (mostly friendly) issues with another person…

The wisdom of language, as Michel Serres once coined it (somewhere in his Hermes series, I suppose) knew everything, it seems. Things are not, because they remain completely beyond even any possibility to perceive them if there is no standard to treat the differential signals it provides. This “treatment” we usually call interpretation.

What we can observe here in the etymological career of “data” is nothing else than a certain relativization, a de-centering of the concept away from the absolute centers of nature, or likewise the divine. We observe nothing else than the evolution of a language game into its reflected use.

This now is just another way to abolish ontology and its existential attitude, at least as far as it claims an “independent” existence. In order to become clear about the concept of data, what we can do about it, or even how to use data, we have to arrive at a proper level of abstraction, that to understand is not a difficult thing in itself.

This, however, also means that “data processing” can’t be conceived in the way as we conceive, for instance, the milling of grain. Data processing should me taken much more as a “data thinging” than as a data milling, or data mining. There is deep relativity in the concept of data, because it is always an interpretation that creates them. It is nonsense to naturalize them in the infamous equation  “information=data+meaning”, we already discussed that in the chapter about information. Yet, this process probably did not reach its full completion, especially not in the discipline of so-called computer “sciences”. Well, every science started as some kind of Hermetism or craftmenship…

Yet, one still might say that at a given point at time we come upon encoded information, we encounter some written, stored, or somehow else materially represented structured differences. Well, ok, that’s  true. However, and that’s a big however: We still can NOT claim that the data is something given.

This raises a question: what are we actually doing when we say that we “process” data? At first sight, and many people think so, that this processing data produces information. But again, it is not a processing in the sense of milling. This information thing is not the result of some kind of milling. It needs constructive activities and calls for affected involvement.

Obviously, the result or the produce of processing data is more data. Data processing is thus a transformation. Probably it is appropriate to say that “data” is the language game for “transforming the possibility for interpretation into its manifold.” Nobody should wonder about the fact that there are more and more “computers” all the time and everywhere. Besides the fact that the “informationalization” of any context allows for a improved generality as well as for improved accuracy (they excluded each other in the mechanical age), the conceptual role of data itself produces an built-in acceleration.

Let us leave the trivial aspects of digital technology behind, that is, everything that concerns mere re-arrangement and recombination without loosing and adding anything. Of course, creating a pivot table may lead to new insights since we suddenly (and simply) can relate things that we couldn’t without pivoting. Nevertheless, it is mere re-arrangement, despite it is helpful, of course. It is clear that pivoting itself does not produce any insight, of course.

Our interest is in machine-based episteme and its possibility. So, the natural question is: How to organize data and its treatment such that machine-based episteme is possible? Obviously this treatment has to be organized and developed in a completely autonomous manner.

Treating Data

In so-called data mining, which only can be considered as a somewhat childish misnomer, people often report that they spend most of the time in preparing data. Up to 80% of the total project time budget is spent for “preparing data”. Nothing else cold render the inappropriate concepts behind data mining more visible than this fact. But one step at a time…

The input data to machine learning are often considered to be extremely diverse. At first place, we have to distinguish between structured and unstructured data, secondly, we unstructured qualities like text or images or the different scales of expression.

Table 1: Data in the Quality Domain

structured data things like tables, or schemes, or data that could be brought into that form in one way or another; often related to physical measurement devices or organizational issues (or habits)
unstructured data  —- entities that can’t be brought into a structured form before processing them in principle. It is impossible to extract the formal “properties” of text before interpreting it; those properties we would have to know before being able to set up any kind of table into which we could store our “measurement”. Hence, unstructured data can’t be “measured”. Everything is created and constructed “on-the-fly”, sailing while building the raft, as Deleuze (Foucault?) put it once. Any input needs to be conceived as and presented to the learning entity in a probabilized form.

Table 1: Data in the Scale Domain

real-valued scale numeric, like 1.232; mathematically: real numbers, (ir)rational numbers, etc. infinitely different values
ordinal scale enumerations, ordering, limited to a rather small set of values, typically n<20, such like 1,2,3,4; mathematically: natural numbers, integers
nominal scale singular textual tokens, such like “a”, “abc”, “word”
binary scale only two values are used for encoding, such as 1,0, or yes,no etc.

Often it is proposed to regard the real-valued scale as the most dense one, hence it is the scale that could be expected to transport the largest amount of information. Despite the fact that this is not always true, it surely allows for a superior way to describe the risk in modeling.

That’s not all of course. Consider for instance domains like the financial industry. Here, all the data are marked by a highly relevant point of anisotropy regarding the scale: the zero. As soon something becomes negative, it belongs to a different category, albeit it could be quite close to another value if we consider just the numeric value. It is such domain specific issues that contribute to the large efforts people spend to the preparation of data. It is clear that any domain is structured by and knows about lot of such “singular” points. People then claim that they have to be a specialist in the respective domain in order to be able to prepare the data.

Yet, that’s definitely not true, as we will see.

In order to understand the important point we have to understand a further feature of data in the context of empirical analysis.Remember, that in empirical analysis we are looking primarily for a mapping function, which transforms values from measurement into values of a prediction or diagnosis, in short, into the values that describe the outcome. In medicine we may measure physiological data in order to achieve a diagnosis, and doing so is almost identical as other people perform measures in an organization.
Measured data can be described by means of a distribution. A distribution simply describes the relative frequency of certain values. Let us resort to the following two. examples. Here you see simply frequency histograms, where each bin reflects the relative frequency of the values falling into the respective bin.

What is immediately striking is that both are far from the analytical distributions like the normal distribution. They are both strongly rugged, far from being smooth. What we can see also: they have more than one peak, even as it is not clear how many peaks there are.

Actually, in data analysis one meets such conditions quite often.

Figure 1a. A frequency distribution showing (at least) two modes.

Figure 1b. A sparsely filled frequency distribution

So, what to do with that?

First, the obvious anisotropy renders any trivial transformation meaningless. Instead, we have to focus precisely those inhomogeneities.  In a process perspective we may reason that the data that have been measured by a single variable actually are from at least two different processes, or that the process is non-stationary and switches between (at least two) different regimes. In either case, we split the variable into two, applying a criterion that is intrinsic to the data. This transformation is called deciling, and it is probably the third-most important transformation that could be applied to data.

Well, let us apply deciling to data shown inFigure 1a.

Figure 2a,b: Distributions after deciling a variable V0 (as of Figure 1a) into V1 and V2. The improved resolution for the left part is not shown.

The result is three variables, and each of them “expresses” some features. Since we can treat them (and the values comprised) independently, we obviously constructed something. Yet, we did not construct a concept, we just introduced additional potential information. At that stage, we do not know whether this deciling will help to build a better model.

Variable V1 (Figure 2a (left part ) ) can be transformed further, by shifting the value to the right through applying a log-transformation. A log-transformation increases the differences between small values and decreases the differences between large values, and it does so in a continuous fashion. As a result, the peak of the distribution will move more to the right (and it will also be less prominent). Imagine a large collection of bank accounts, most of them filled with amounts between 1’000 and 20’000, while some host 10’00’000.  If we map all those values onto the same width, the small amounts can’t be well distinguished any more, and we have to do that mapping, called linear normalization, with all our variables in order to make variances comparable. It is mandatory to transform such left-skewed distributions into a new variable in order to access the potential information represented by it. Yet, as always in data analysis, before we didn’t complete the whole modeling cycle down to validation we can not know whether a particular transformation will have any or even a positive effect for the power of our model.

The log transformation has a further quite neat feature: it is defined only for positive values. Thus, is we apply a transformation that creates negative values for some of the observed values and subsequently apply a log-transform, we create missing values. In other words, we disregard some parts of the information that originally has been available in the data. So, a log-transform can be used to

  • – render items discernible in left-skewed distributions, and to
  • – blend out parts of information dedicatedly by a numeric transformation.

These two possible achievements make the log-transform one of the most frequently applied.

The most important transformation in predictive modeling is the construction of new variables by combining a small number (typically 2) of hitherto available ones, either analytically by some arithmetics, or more generally, any suitable mapping, inclusive the SOM, from n variables to 1 variable. Yet, this will be discussed at a later point (in another chapter, for an overview see here). The trick is to find the most promising of such combinations of variables, because obviously the number of possible combinations is almost infinitely large.

Anyway, the transformed data will be subject to an associative mechanism, such like the SOM. Such mechanism are based on the calculation of similarities and the comparison of similarity values. That is, the associative mechanism does not consider any of the tricky transformations, it just reflects the differences in the profiles (see here for a discussion of that).

Up to this point the conclusion is quite clear. Any kind of data preparation just has to improve the distinguishability of individual bits. Since we anyway do not know anything about the structure of the relationship between measurement, the prediction and the outcome we try to predict, there is nothing else we could do in advance. On the second line this means that there is no need to import any kind of semantics. Now remember that transforming data is an analytic activity, while it is the association of things that is a constructive activity.

There is a funny effect of this principle of discernibility. Imagine an initial model that comprises two variables v-a and v-b, among some others, for which we have found that the combination a*b provides a better model. In other words, the associative mechanism found a better representation for the mapping of the measurement to the outcome variable. Now first remember that all values for any kind of associative mechanism has to be scaled to the interval [0..1]. Multiplying two sets of such values introduces a salient change if both values are small or if both values are large. So far, so good. The funny thing is that the same degree of discernibility can be achieved by the transformative coupling v-a/v-b, by the division. The change is orthogonal to that introduced by the multiplication, but that is not relevant for the comparison of profiles. This simple effect nicely explains a “psychological” phenomenon… actually, it is not psychological but rather an empiric one: One can invert the proposal about a relationship between any two variables without affecting the quality of the prediction. Obviously, it is rather not the transformative function as such that we have to consider as important. Quite likely, it is the form aspect of the data space warping qua transformation that we should focus on.

All of those transformation efforts exhibit two interesting phenomena. First, we apply them all as a hypothesis, which describes the relation between data, the (more or less) analytic transformation, the associative mechanism, and the power of the model. If we can improve the power of the model by selecting just the suitable transformations, we also know which transformations are responsible for that improvement. In other words, we carried out a data experiment, that, and that’s the second point to make here, revealed a structural hypothesis about the system we have measured. Structural hypotheses, however, could qualify as pre-cursors of concepts and ideas. This switching forth and back between the space of hypotheses H and the space of models (or the learning map L, as Poggio et al. [1] call it)

Thus we end up with the insight that any kind of data preparation can be fully automated, which is quite contrary to the mainstream. For the mere possibility of machine-based episteme it is nevertheless mandatory. Fortunately, it is also achievable.

One (or two) last word on transformations. A transformation is nothing else than a method, and importantly, vice versa. This means that any method is just: a potential transformation. Secondly, transformations are by far, and I mean really by far, more important than the choice of the associative method. There is almost no (!) literature about transformations, and almost all publications are about the proclaimed features of a “new” method. Such method hell is dispensable. The chosen method just needs to be sufficiently robust, i.e. it should not—preferably: never—introduce a method-specific bias or, alternatively, it should allow to control as much of its internal parameters as possible. Thus we chose the SOM. It is the most transparent and general method to associate data into groups for establishing the transition from extensions to intensions.

Besides the choice of the final model, the construction of a suitable set of transformation is certainly one of the main jobs in modeling.

Automating the Preparation of Data

How to automate the preparation of data? Fortunately, this question is relatively easy to answer: by machine-learning.

What we need is just a suitable representation of the problematics. In other words, we have to construct some properties that together potentially describe the properties of the data, especially the frequency distribution.

We have made good experiences by applying curve fitting to the distribution in order to create the fingerprint that describe the properties of the values represented by a variable. For instance, a 5-th order polynomial, together with an negative exponential and a harmonic fit (trigonometric functions) are essential for such a fingerprint (don’t forget the first derivatives, and the deviation from the models). Further properties are the count and location of empty bins. The resulting vector typically comprises some 30 variables and thus contains enough information for learning the appropriate transformation.

Conclusion

We have seen that the preparation of data can be automated. Only very few domain-specific rules are necessary to be defined apriori, such as the anisotropy around zero for the financial domain. Yet, the important issue is that they indeed can be defined apriori, outside the modeling process, and fortunately, they are usually quite well-known.

The automation of the preparation of data is not an exotic issue. Our brain does it all the time. There is no necessity for an expert data-mining homunculus. Referring to the global scheme of targeted modeling (in the chapter about technical aspects) we now have completed the technical issues for this part. Since we already handled the part of associative storage, “only” two further issues on our track towards machine-based episteme remain: the issue of the emergence of ideas and concepts, and secondly, the glue between all of this.

From a wider perspective we definitely experienced the relativity of data. It is not appropriate to conceive data as “givens”. Quite in contrast, they should be considered as subject for experimental re-combination, as kind of an invitation to transform them.

Data should not be conceived as a result of experiments or measurements, some kind of immutable entities. Such beliefs are directly related to naive realism, to positivism or the tradition of logical empiricism. In contrast, data are the subject or the substrate of experiments of their own kind.

Once the purpose of modeling is given, the automation of modeling thus is possible.  Yet, this “purpose” can be first quite abstract, and usually it is something that results from social processes. It is a salient and an open issue, not only for machine-based episteme, how to create, select or achieve a “purpose.”

Even as it still remains within the primacy of interpretation, it is not clear so far whether targeted modeling can contribute here. We guess, not so much, at least not for its own. What we obviously need is a concept for “ideas“.

  • [1] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee1 & Partha Niyogi (2004). General conditions for predictivity in learning theory. Nature 428: 419-422 (25 March 2004).

۞

Beyond Containing: Associative Storage and Memory

February 14, 2012 § Leave a comment

Memory, our memory, is a wonderful thing. Most of the time.

Yet, it also can trap you, sometimes terribly, if you use it in inappropriate ways.

Think about the problematics of being a witness. As long as you don’t try to remember exactly you know precisely. As soon as you start to try to achieve perfect recall, everything starts to become fluid, first, then fuzzy and increasingly blurry. As if there would be some kind of uncertainty principle, similar to Heisenberg’s [1]. There are other tricks, such as asking a person the same question over and over again. Any degree of security, hence knowledge, will vanish. In the other direction, everybody knows about the experience that a tiny little smell or sound triggers a whole story in memory, and often one that have not been cared about for a long time.

The main strengths of memory—extensibility, adaptivity, contextuality and flexibility—could be considered also as its main weakness, if we expect perfect reproducibility for results of “queries”. Yet, memory is not a data base. There are neither symbols, nor indexes, and at the deeper levels of its mechanisms, also no signs. There is no particular neuron that would “contain” information as a file on a computer can be regarded able to provide.

Databases are, of course, extremely useful, precisely because they can’t do in other ways as to reproduce answers perfectly. That’s how they are designed and constructed. And precisely for the same reason we may state that databases are dead entities, like crystals.

The reproducibility provided by databases expels time. We can write something into a database, stop everything, and continue precisely at the same point. Databases do not own their own time. Hence, they are purely physical entities. As a consequence, databases do not/can not think. They can’t bring or put things together, they do not associate, superpose, or mix. Everything is under the control of an external entity. A database does not learn when the amount of bits stored inside it increases. We also have to be very clear about the fact that a database does not interpret anything. All this should not be understood as a criticism, of course, these properties are intended by design.

The first important consequence about this is that any system relying just on the principles of a database also will inherit these properties. This raises the question about the necessary and sufficient conditions for the foundations of  “storage” devices that allow for learning and informational adaptivity.

As a first step one could argue that artificial systems capable for learning, for instance self-organizing maps, or any other “learning algorithm”, may consist of a database and a processor. This would represent the bare bones of the classic von Neumann architecture.

The essence of this architecture is, again, reproducibility as a design intention. The processor is basically empty. As long as the database is not part of a self-referential arrangement, there won’t be something like a morphological change.

Learning without change of structure is not learning but only changing the value of structural parameters that have been defined apriori (at implementation time). The crucial step however would be to introduce those parameters at all. We will return to this point at a later stage of our discussion, when it comes to describe the processing capabilities of self-organizing maps.1

Of course, the boundaries are not well defined here. We may implement a system in a very abstract manner such that a change in the value of such highly abstract parameters indeed involves deep structural changes. In the end, almost everything can be expressed by some parameters and their values. That’s nothing else than the principle of the Deleuzean differential.

What we want to emphasize here is just the issue that (1) morphological changes are necessary in order to establish learning, and (2) these changes should be established in response to the environment (and the information flowing from there into the system). These two condition together establish a third one, namely that (3) a historical contingency is established that acts as a constraint on the further potential changes and responses of the system. The system acquires individuality. Individuality and learning are co-extensive. Quite obviously, such a system is not a von Neumann device any longer, even if it still runs on a such a linear machine.

Our claim here is that the “learning” requires a particular perspective on the concept of “data” and its “storage.” And, correspondingly, without the changed concept about the relation between data and storage, the emergence of machine-based episteme will not be achievable.

Let us just contrast the two ends of our space.

  • (1) At the logical end we have the von Neumann architecture, characterized by empty processors, perfect reproducibility on an atomic level, the “bit”; there is no morphological change; only estimation of predefined parameters can be achieved.
  • (2) The opposite end is made from historically contingent structures for perception, transformation and association, where the morphology changes due to the interaction with the perceived information2; we will observe emergence of individuality; morphological structures are always just relative to the experienced influences; learning occurs and is structural learning.

With regard to a system that is able to learn, one possible conclusion from that would be to drop the distinction between storage of encoded information and the treatment of that  encodings. Perhaps, it is the only viable conclusion to this end.

In the rest of this chapter we will demonstrate how the separation between data and their transformation can be overcome on the basis of self-organizing maps. Such a device we call “associative storage”. We also will find a particular relation between such an associative storage and modeling3. Notably, both tasks can be accomplished by self-organizing maps.

Prerequisites

When taking the perspective from the side of usage there is still another large contrasting difference between databases and associative storage (“memories”). In case of a database, the purpose of a storage event is known at the time of performing the storing operation. In case of memories and associative storage this purpose is not known, and often can’t be reasonably expected to be knowable by principle.

From that we can derive a quite important consequence. In order to build a memory, we have to avoid storing the items “as such,” as it is the case for databases. We may call this the (naive) representational approach. Philosophically, the stored items do not have any structure inside the storage device, neither an inner structure, nor an outer one. Any item appears as a primitive qualia.

The contrast to the process in an associative storage is indeed a strong one. Here, it is simply forbidden to store items in an isolated manner, without relation to other items, as an engram, an encoded and reversibly decodable series of bits. Since a database works perfectly reversible and reproducible, we can encode the graphem of a word into a series of bits and later decode that series back into a graphem again, which in turn we as humans (with memory inside the skull) can interpret as words. Strictly taken, we do NOT use the database to store words.

More concretely, what we have to do with the items comprises two independent steps:

  • (1) Items have to be stored as context.
  • (2) Items have to be stored as probabilized items.

The second part of our re-organized approach to storage is a consequence of the impossibility to know about future uses of a stored item. Taken inversely, using a database for storage always and strictly implies that the storage agent claims to know perfectly about future uses. It is precisely this implication that renders long-lasting storage projects so problematic, if not impossible.

In other words, and even more concise, we may say that in order to build a dynamic and extensible memory we have to store items in a particular form.

Memory is built on the basis of a population of probabilistic contexts in and by an associative structure.

The Two-Layer SOM

In a highly interesting prototypical model project (codename “WEBSOM”) Kaski (a collaborator of Kohonen) introduced a particular SOM architecture that serves the requirements as described above [2]. Yet, Kohonen (and all of his colleagues alike) did not recognize so far the actual status of that architecture. We already mentioned this point in the chapter about some improvements of the SOM design; Kohonen fails to discern modeling from sorting, when he uses the associative storage as a modeling device. Yet, modeling requires a purpose, operationalized into one or more target criteria. Hence, an associative storage device like the two-layer SOM can be conceived as a pre-specific model only.

Nevertheless, this SOM architecture is not only highly remarkable, but we also can easily extend it appropriately; thus it is indeed so important, at least as a starting point, that we describe it briefly here.

Context and Basic Idea

The context for which the two-layer SOM (TL-SOM) has been created is document retrieval by classification of texts. From the perspective of classification,texts are highly complex entities. This complexity of texts derives from the following properties:

  • – there are different levels of context;
  • – there are rich organizational constraints, e.g. grammars
  • – there is a large corpus of words;
  • – there is a large number of relations that not only form a network, but which also change dynamically in the course of interpretation.

Taken together, these properties turn texts into ill-defined or even undefinable entities, for which it is not possible to provide a structural description, e.g. as a set of features, and particularly not in advance to the analysis. Briefly, texts are unstructured data. It is clear, that especially non-contextual methods like the infamous n-grams are deeply inappropriate for the description, and hence also for the modeling of texts. The peculiarity of texts has been recognized long before the age of computers. Around 1830 Friedrich Schleiermacher founded the discipline of hermeneutics as a response to the complexity of texts. In the last decades of the 20ieth century, it was Jacques Derrida who brought in a new perspective on it. in Deleuzean terms, texts are always and inevitably deterritorialized to a significant portion. Kaski & coworkers addressed only a modest part of these vast problematics, the classification of texts.

The starting point they took by was to preserve context. The large variety of contexts makes it impossible to take any kind of raw data directly as input for the SOM. That means that the contexts had to be encoded in a proper manner. The trick is to use a SOM for this encoding (details in next section below). This SOM represents the first layer. The subject of this SOM are the contexts of words (definition below). The “state” of this first SOM is then used to create the input for the SOM on the second layer, which then addresses the texts. In this way, the size of the input vectors are standardized and reduced in size.

Elements of a Two-Layer SOM

The elements, or building blocks, of a TL-SOM devised for the classification of texts are

  • (1) random contexts,
  • (2) the map of categories (word classes)
  • (3) the map of texts

The Random Context

A random context encodes the context of any of the words in a text. let us assume for the sake of simplicity that the context is bilateral symmetric according to 2n+1, i.e. for example with n=3 the length of the context is 7, where the focused word (“structure”) is at pos 3 (when counting starts with 0).

Let us resort to the following example, that take just two snippets from this text. The numbers represent some arbitrary enumeration of the relative positions of the words.

sequence A of words rel. positions in text “… without change of structureis not learning …”53        54    55    56       57 58     59
sequence B of words rel. positions in text “… not have any structureinside the storage …”19    20  21       22         23    24     25

The position numbers we just need for calculating the positional distance between words. The interesting word here is “structure”.

For the next step you have to think about the words listed in a catalog of indexes, that is as a set whose order is arbitrary but fixed. In this way, any of the words gets its unique numerical fingerprint.

Index Word Random Vector
 …  …
1264  structure 0.270    0.938    0.417    0.299    0.991 …
1265  learning 0.330    0.990    0.827    0.828    0.445 …
 1266  Alabama 0.375    0.725    0.435    0.025    0.915 …
 1267  without 0.422    0.072    0.282    0.157    0.155 …
 1268  storage 0.237    0.345    0.023    0.777    0.569 …
 1269  not 0.706    0.881    0.603    0.673    0.473 …
 1270  change 0.170    0.247    0.734    0.383    0.905 …
 1271  have 0.735    0.472    0.661    0.539    0.275 …
 1272  inside 0.230    0.772    0.973    0.242    0.224 …
 1273  any 0.509    0.445    0.531    0.216    0.105 …
 1274  of 0.834    0.502    0.481    0.971    0.711 …
1274  is 0.935    0.967    0.549    0.572    0.001 …
 …

Any of the words of a text can now be replaced by an apriori determined vector of random values from [0..1]; the dimensionality of those random vectors should be around  80 in order to approximate orthogonality among all those vectors. Just to be clear: these random vectors are taken from a fixed codebook, a catalog as sketched above, where each word is assigned to exactly one such vector.

Once we have performed this replacement, we can calculate the averaged vectors per relative position of the context. In case of the example above, we would calculate the reference vector for position n=0 as the average from the vectors encoding the words “without” and “not”.

Let us be more explicit. For example sentence A we translate first into the positional number, interpret this positional number as a column header, and fill the column with the values of its respective fingerprint. For the 7 positions (-3, +3) we get 7 columns:

sequence A of words “… without change of structure is not learning …”
rel. positions in text        53        54    55    56       57 58     59
 grouped around “structure”         -3       -2    -1       0       1    2     3
random fingerprints
per position
0.422  0.170  0.834  0.270  0.935  0.706  0.330
0.072  0.247  0.502  0.938  0.967  0.881  0.990
0.282  0.734  0.481  0.417  0.549  0.603  0.827

…further entries of the fingerprints…

The same we have to do for the second sequence B. Now we have to tables of fingerprints, both comprising 7 columns and N rows, where N is the length of the fingerprint. From these two tables we calculate the average value and put it into a new table (which is of course also of dimensions 7xN). Such, the example above yields 7 such averaged reference vectors. If we have a dimensionality of 80 for the random vectors we end up with a matrix of [r,c] = [80,7].

In a final step we concatenate the columns into a single vector, yielding a vector of 7×80=560 variables. This might appear as a large vector. Yet, it is much smaller than the whole corpus of words in a text. Additionally, such vectors can be compressed by the technique of random projection (math. foundations by [3], first proposed for data analysis by [4], utilized for SOMs later by [5] and [6]), which today is quite popular in data analysis. Random projection works by matrix multiplication. Our vector (1R x  560C) gets multiplied with a matrix M(r) of 560R x 100C, yielding a vector of 1R x 100C. The matrix M(r) also consists of flat random values. This technique is very interesting, because no relevant information is lost, but the vector gets shortened considerable. Of course, in an absolute sense there is a loss of information. Yet, the SOM only needs the information which is important to distinguish the observations.

This technique of transferring a sequence made from items encoded on an symbolic level into a vector that is based on random context can be applied to any symbolic sequence of course.

For instance, it would be a drastic case of reductionism to conceive of the path taken by humans in an urban environment just as a sequence locations. Humans are symbolic beings and the urban environment is full of symbols to which we respond. Yet, for the population-oriented perspective any individual path is just a possible path. Naturally, we interpret it as a random path. The path taken through a city needs to be described both by location and symbol.

The advantage of the SOM is that the random vectors that encode the symbolic aspect can be combined seamlessly with any other kind of information, e.g. the locational coordinates. That’s the property of the multi-modality. Which particular combination of “properties” then is suitable to classify the paths for a given question then is subject for “standard” extended modeling as described inthe chapter Technical Aspects of Modeling.

The Map of Categories (Word Classes)

From these random context vectors we can now build a SOM. Similar contexts will arrange in adjacent regions.

A particular text now can be described by its differential abundance across that SOM. Remember that we have sent the random contexts of many texts (or text snippets) to the SOM. To achieve such a description a (relative) frequency histogram is calculated, which has as much classes as the SOM node count is. The values of the histogram is the relative frequency (“probability”) for the presence of a particular text in comparison to all other texts.

Any particular text is now described by a fingerprint, that contains highly relevant information about

  • – the context of all words as a probability measure;
  • – the relative topological density of similar contextual embeddings;
  • – the particularity of texts across all contextual descriptions, again as a probability measure;

Those fingerprints represent texts and they are ready-mades for the final step, “learning” the classes by the SOM on the second layer in order to identify groups of “similar” texts.

It is clear, that this basic variant of a Two-Layer SOM procedure can be improved in multiple ways. Yet, the idea should be clear. Some of those improvements are

  • – to use a fully developed concept of context, e.g. this one, instead of a constant length context and a context without inner structure;
  • – evaluating not just the histogram as a foundation of the fingerprint of a text, but also the sequence of nodes according to the sequence of contexts; that sequence can be processed using a Markov-process method, such as HMM, Conditional Random Fields, or, in a self-similar approach, by applying the method of random contexts to the sequence of nodes;
  • – reflecting at least parts of the “syntactical” structure of the text, such as sentences, paragraphs, and sections, as well as the grammatical role of words;
  • – enriching the information about “words” by representing them not only in their observed form, but also as their close synonyms, or stuffed with the information about pointers to semantically related words as this can be taken from labeled corpuses.

We want to briefly return to the first layer. Just imagine not to measure the histogram, but instead to follow the indices of the contexts across the developed map by your fingertips. A particular path, or virtual movement appears. I think that it is crucial to reflect this virtual movement in the input data for the second layer.

The reward could be significant, indeed. It offers nothing less than a model for conceptual slippage, a term which has been emphasized by Douglas Hofstadter throughout his research on analogical and creative thinking. Note that in our modified TL-SOM this capacity is not an “extra function” that had to be programmed. It is deeply built “into” the system, or in other words, it makes up its character. Besides Hofstadter’s proposal which is based on a completely different approach, and for a different task, we do not know of any other system that would be able for that. We even may expect that the efficient production of metaphors can be achieved by it, which is not an insignificant goal, since all the practiced language is always metaphoric.

Associative Storage

We already mentioned that the method of TL-SOM extracts important pieces of information about a text and represents it as a probabilistic measure. The SOM does not contain the whole piece of text as single entity, or a series of otherwise unconnected entities, the words. The SOM breaks the text up into overlapping pieces, or better, into overlapping probabilistic descriptions of such pieces.

It would be a serious misunderstanding to perceive this splitting into pieces as a drawback or failure. It is the mandatory prerequisite for building an associative storage.

Any further target oriented modeling would refer to the two layers of a TL-SOM, but never to the raw input text.Such it can work reasonable fast for a whole range of different tasks. One of those tasks that can be solved by a combination of associative storage and true (targeted) modeling is to find an optimized model for a given text, or any text snippet, including the identification of the discriminating features.  We also can turn the perspective around, addressing the query to the SOM about an alternative formulation in a given context…

From Associative Storage towards Memory

Despite its power and its potential as associative storage, the Two-Layer SOM still can’t be conceived as a memory device. The associative storage just takes the probabilistically described contexts and sorts it topologically into the map. In order to establish “memory” further components are required that provides the goal orientation.

Within the world of self-organizing maps, simple (!) memories are easy to establish. We just have to combine a SOM that acts as associative storage with a SOM for targeted modeling. The peculiar distinctive feature of that second SOM for modeling is that it does not work on external data, but on “data” as it is available in and as the SOM that acts as associative storage.

We may establish a vivid memory in its full meaning if we establish two further components: (1) targeted modeling via the SOM principle, (2) a repository about the targeted models that have been built from (or using) the associative storage, and (3) at least a partial operationalization of a self-reflective mechanism, i.e. a modeling process that is going to model the working of the TL-SOM. Since in our framework the basic SOM module is able to grow and to differentiate, there is no principle limitation of/for such a system any more, concerning its capability to build concepts, models, and (logical) habits for navigating between them. Later, we will call the “space” where this navigation takes place “choreosteme“: Drawing figures into the open space of epistemic conditionability.

From such a memory we may expect dramatic progress concerning the “intelligence” of machines. The only questionable thing is whether we should call such an entity still a machine. I guess, there is neither a word nor a concept for it.

u .

Notes

1. Self-organizing maps have some amazing properties on the level of their interpretation, which they share especially with the Markov models. As such, the SOM and Markov models are outstanding. Both, the SOM as well as the Markov model can be conceived as devices that can be used to turn programming statements, i.e. all the IF-THEN-ELSE statements occurring in a program as DATA. Even logic itself, or more precisely, any quasi-logic, is getting transformed into data.SOM and Markov models are double-articulated (a Deleuzean notion) into logic on the one side and the empiric on the other.

In order to achieve such, a full write access is necessary to the extensional as well as the intensional layer of a model. Hence, artificial neuronal networks (nor, of course, statistical methods like PCA) can’t be used to achieve the same effect.

2. It is quite important not to forget that (in our framework) information is nothing that “is out there.” If we follow the primacy of interpretation, for which there are good reasons, we also have to acknowledge that information is not a substantial entity that could be stored or processed. Information is nothing else than the actual characteristics of the process of interpretation. These characteristics can’t be detached from the underlying process, because this process is represented by the whole system.

3. Keep in mind that we only can talk about modeling in a reasonable manner if there is an operationalization of the purpose, i.e. if we perform target oriented modeling.

  • [1] Werner Heisenberg. Uncertainty Principle.
  • [2] Samuel Kaski, Timo Honkela, Krista Lagus, Teuvo Kohonen (1998). WEBSOM – Self-organizing maps of document collections. Neurocomputing 21 (1998) 101-117.
  • [3] W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189–206. Amer. Math. Soc., 1984.
  • [4] R. Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. In J.M. Zurada, R.J. Marks II, and C.J. Robinson, editors, Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994.
  • [5] Papadimitriou, C. H., Raghavan, P., Tamaki, H., & Vempala, S. (1998). Latent semantic indexing: A probabilistic analysis. Proceedings of the Seventeenth ACM Symposium on the Principles of Database Systems (pp. 159-168). ACM press.
  • [6] Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: Applications to image and text data. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 245-250). ACM Press.

۞

SOM = Speedily Organizing Map

February 12, 2012 § Leave a comment

The Self-organizing Map is a powerful and high-potential computational procedure.

Yet, there is no free lunch, especially not for procedures that are able to deliver meaningful results.

The self-organizing map is such a valuable procedure, we have discussed its theoretical potential with regard to a range of different aspects in other chapters. Here, we want not to deal further with such theoretical or even philosophical issues, e.g. related to the philosophy of mind, instead we focus the issue of performance, understood simply in terms of speed.

For all those demo SOMs the algorithmic time complexity is not really an issue. The algorithm approximates rather quickly to a stable state. Yet, small maps—where “small” means something like less than 500 nodes or so—are not really interesting. It is much like in brains. Brains are made largely from neurons and some chemicals, and a lot of them can do amazing things. If you take 500 of them you may stuff a worm in an appropriate manner, but not even a termite. The important questions thus are, beyond the nice story about theoretical benefits.

What happens with the SOM principle if we connect 1’000’000 nodes?

How to organize 100, 1000 or 10’000 of such million-nodes SOMs?

By these figures we would end up with somewhat around 1..10 billions of nodes1, all organized along the same principles. Just to avoid a common misunderstanding here: these masses of neurons are organized in a very similar manner, yet the totality of them builds a complex system as we have described it in our chapter about complexity. There are several, if not many emergent levels, and a lot of self-referentiality. These 1 billion nodes are not all engaged with segmenting external data! We will see elsewhere, in the chapter about associative storage and memory, how such deeply integrated modular system could be conceived of. There are some steps to take, though not terribly complicated or difficult ones.

When approaching such scales, the advantage of the self-organization turns palpably into a problematic disadvantage. “Self-organizing” means “bottom-up,” and this bottom-up direction in SOMs is represented by the fact that all records representing the observations have repeatedly to be compared to all nodes in the SOM in order to find the so-called “best matching unit” (BMU). The BMU is just that node in the network that exhibits an intensional profile that is the most similar among all the other profiles2. Though the SOM avoids to compare all records to all records, its algorithmic complexity scales as a power-function with respect to its own scale! Normally, algorithms are dependent on the size of the data, but not to its own “power.”

In its naive form the SOM shows a complexity of something like O(n w m2), where n is the amount of data (number of records, size of feature set), w the number of nodes to be visited for searching the BMU, and m2 the number of nodes affected in the update procedure. w and m are scaled by factors f1,f2 <1, but the basic complexity remains. The update procedure affects an area that is dependent on the size of the SOM, therefore the exponent. The exact degree of algorithmic complexity is not absolutely determined, since it depends on the dynamics in the learning function, among other things.

The situation worsens significantly if we apply improvements to the original flavor of the SOM, e.g.

  • – the principle of homogenized variance (per variable across extensional containers),
  • – in targeted modeling, tracking the explicit classification performance per node on the level of records, which means that the data has to be referenced
  • – size balancing of nodes,
  • – morphological differentiation like growing and melting, as in the FluidSOM, which additionally allows for free ranging nodes,
  • – evolutionary feature selection and creating proto-hypothesis,
  • – dynamic transformation of data,
  • – then think about the problem of indeterminacy of empiric data, which enforces differential modeling, i.e. a style of modeling that is performing experimental investigation of the dependency of the results on the settings (the free parameters) of the computational procedure: sampling the data, choosing the target, selecting a resolution for the scale of the resulting classification, choosing a risk attitude, among several more.

All affects the results of modeling, that is the prognostic/diagnostic/semantic conclusions one could draw from the modeling. Albeit all these steps could be organized based on a set of rules, including applying another instance of a SOM, and thus could be run automatically, all of these necessary explorations require separate modeling. It is quite easy to set up an exploration plan for differential modeling that would require several dozens of models, and if evolutionary optimization is going to be applied, 100s if not thousands of different maps have to be calculated.

Fortunately, the SOM offers a range of opportunities for using dynamic look-up tables and parallel processing. A SOM consisting of 1’000’000 neurons could easily utilize several thousand threads, without much worries about concurrency (or the collisions of parallel threads). Unfortunately, such computers are not available yet, but you got the message…

Meanwhile we have to apply optimization through dynamically built look-up tables.  These I will describe briefly in the following sections.

Searching the Most Similar Node

An ingredient part of speeding up the SOM in real-life application is an appropriate initialization of the intensional profiles across the map. Of course, precisely this can not be known in advance, at least not exactly. The self-organization of the map is the shortest path to its final state, there is no possibility for an analytic short-cut. Kohonen proposes to apply Principal Component Analysis (PCA) for calculating the initial values. I am convinced that this is not a good idea. The PCA is deeply incompatible with the SOM, hence it will respond to very different traits in the data. PCA and SOM behave similar only in the case of demo cases…

Preselecting the Landing Zone

A better alternative is the SOM itself. Since the mapping organized by the SOM is preserving the topology of the data, we could apply a much smaller SOM, even a nested series of down-scaled SOMs to create a coarse model for selecting the appropriate sub-population in the large SOM. The steps are the following:

  • 1. create the main SOM, say 40’000 nodes, organized on a square rectangle, where the sides are of the relative length of 200 nodes;
  • 2. create a SOM for preselecting the landing zone, scaled approximately 14 by 14 nodes, and use the same structure (i.e. the same feature vectors) as for the large SOM;
  • 3. Prime the small SOM with a small but significant sample of the data, which comprise say 2000..4000 records in this case of around 200 nodes; draw this sample randomly from the data; this step should complete comparatively quick (by a factor of 200 in our example);
  • 4. initialize the main SOM by a blurred (geometric) projection of the intensional profiles from the minor to the larger SOM;
  • 5. now use the minor SOM as a model for the selection of the landing zone, simply by means of geometric projection.

As a result, the number of nodes to be visited in the large SOM in order to find the best match remains almost constant.
There is an interesting correlate to this technique. If one needs a series of SOM based representations of the data distinguished just by the maximum number of nodes in the map, one should always start with the lowest, i.e. most coarse resolution, with the least number of nodes. The results then can be used as a projective priming of the SOM on the next level of resolution.

Explicit Lookup Tables linking Observations to SOM areas

In the previous section we described the usage of a secondary much smaller SOM as a device for limiting the number of nodes to be scanned. The same problematics can be addressed by explicit lookup tables that establish a link between a given record and a vicinity around the last (few) best matching units.

If the SOM is approximately stable, that is, after the SOM has seen a significant portion of the data, it is not necessary any more to check the whole map. Just scan the vicinity around the last best matching node in the map. Again, the number of nodes necessary to be checked is almost constant.

The stability can not be predicted in advance, of course. The SOM is, as the name says, self-organizing (albeit in a weak manner). As a rule of thumb one could check the average number of observations attached to a particular node, the average taken across all nodes that contain at least one record. This average filling should be larger than 8..10 (due to considerations based on variance, and some arguments derived from non-parametric statistics… but it is a rule of thumb).

Large Feature Vectors

Feature vectors can be very large. In life sciences and medicine I experienced cases with 3000 raw variables. During data transformation this number can increase to 4000..6000 variables. Te comparison of such feature vectors is quite expensive.

Fortunately, there are some nice tricks, which are all based on the same strategy. This strategy comprises the following steps.

  • 1. create a temporary SOM with the following, very different feature vector; this vector has just around 80..100 (real-valued) positions and 1 position for the index variable (in other words, the table key); such the size of the vector is a 60th of the original vector, if we are faced with 6000 variables.
  • 2. create the derived vectors by encoding the records representing the observation by a technique that is called “random projection”; such a projection is generated by multiplying the data vector with a token from a catalog of (labeled) matrices, that are filled with uniform random numbers ([0..1]).
  • 3. create the “random projection” SOM based on these transformed records
  • 4. after training, replace the random projection data with real data, re-calculate the intensional profiles accordingly, and run a small sample of the data through the SOM for final tuning.

The technique of random projection has been invented in 1988. The principle works because of two facts:

  • (1) Quite amazingly, all random vectors beyond a certain dimensionality (80..200, as said before) are nearly orthogonal to each other. The random projection compresses the original data without loosing the bits of information that are distinctive, even if they are not accessible in an explicit manner any more.
  • (2) The only trait of the data that is considered by the SOM is their potential difference.

Bags of SOMs

Finally, one can take advantage of splitting the data into several smaller samples. These samples require only smaller SOMs, which run much faster (we are faced with a power law). After training each of the SOMs, they can be combined into a compound model.

This technique is known as bagging in Data Mining. Today it is also quite popular in the form of so-called random forests, where instead of one large decision tree man smaller ones are built and then combined. This technique is very promising, since it is a technique of nature. Its simply modularization on an abstract level, leading the net level of abstraction in a seamless manner. It is also one of our favorite principles for the overall “epistemic machine”.

Notes

1. This would represent just around 10% of the neurons of our brain, if we interpret each node as a neuron. Yet, this comparison is misleading. The functionality of a node in a SOM rather represents a whole population of neurons, although there is no 1:1 principle transferable between them. Hence, such a system would be roughly of the size of a human brain, and much more important, it is likely organized in a comparable, albeit rather alien, manner.

2. Quite often, the vectors that are attached to the nodes are called weight vectors. This is a serious misnomer, as neither the nodes are weighted by this vector (alone), nor the variables that make up that vector (for more details see here). Conceptually it is much more sound to call those vectors “intensional profiles.” Actually, one could indeed imagine  a weight vector that would control the contribution (“weight”) of variables to the similarity / distance between two of such vectors. Such weight vectors could even be specific for each of the nodes.

References…

  • [1]

.

۞

Similarity

December 30, 2011 § 1 Comment

Similarity appears to be a notoriously inflationary concept.

Already in 1979 a presumably even incomplete catalog of similarity measures in information retrieval listed almost 70 ways to determine similarity [1]. In contemporary philosophy, however, it is almost absent as a concept, probably because it is considered merely as a minor technical aspect of empiric activities. Often it is also related to naive realism,which claimed a similarity between a physical reality and concepts. Similarity is also a central topic in cognitive psychology, yet not often discussed, probably for the same reasons as in philosophy.

In both disciplines, understanding is usually equated with drawing conclusions. Since the business of drawing conclusions and describing the kinds and surrounds of that is considered to be the subject of logic (as a discipline), it is comprehensible that logic has been rated by many practitioners and theoreticians alike as the master discipline. While there is a vivid discourse about logical aspects for many centuries now, the role of similarity is largely neglected, and where vagueness makes its way to the surface, it is “analyzed” completely within logic. Also not quite surprising, artificial intelligence focused strongly on a direct link towards propositional logic and predicate calculus for a long period of time. This link has been represented by the programming language “Prolog,” an abbreviation standing for “programming in logic.” It was established in the first half of the 1970ies by the so-called Edinburgh-school. Let us just note that this branch of machine-learning disastrously failed. Quite remarkably, the generally attested reason for this failure has been called the “knowledge acquisition bottleneck” by the community suffering from it. Somehow the logical approach was completely unsuitable for getting in touch with the world, which actually is not really surprising for anyone who understood Wittgenstein’s philosophical work, even if only partially. Today, the logic oriented approach is generally avoided in machine-learning.

As a technical aspect, similarity is abundant in the so-called field of data mining. Yet, there it is not discussed as a subject in its own rights. In this field, as represented by the respective software tools, rather primitive notions of similarity are employed, importing a lot of questionable assumptions. We will discuss them a bit later.

There is a particular problematics with the concept of similarity, that endangers many other abstract terms, too. This problematics appears if the concept is equated with its operationalization. Sciences and engineering are particularly prone for the failure to be aware of this distinction. It is an inevitable consequence of the self-conception of science, particularly the hypothetico-deductive approach [cf. 2], to assign ontological weight to concepts. Nevertheless, such assignment always commits a naturalization fallacy. Additionally, we may suggest that ontology itself is a deep consequence of an overly scientific, say: positivistic, mechanic, etc. world view. Dropping the positivistic stance removes ontology as a relevant attitude.

As a consequence, science is not able to reflect about the concept itself. What science can do, regardless the discipline, is just to propose further variants as representatives of a hypothesis, or to classify the various proposed approaches. This poses a serious secondary methodological problematics, since it equally holds that there is no science without the transparent usage of the concept of similarity. Science should control free parameters of experiments and their valuation. Somewhat surprisingly, almost the “opposite” can be observed. The fault is introduced by statistics, as we will see, and this result really came as a surprise even for me.

A special case is provided by “analytical” linguistics, where we can observe a serious case of reduction. In [3], the author selects the title “Vagueness and Linguistics,” but also admits that “In this paper I focused my discussion on relative adjectives.” Well, vagueness can hardly be restricted to anything like relative adjectives, even in linguistics. Even more astonishing is the fact that similarity does not appear as a subject at all in the cited article (except in a reference to another author).

In the field engaged in the theory of the metaphor [cf. 4, or 5], one can find a lot of references to similarity. In any case known to me it is, however, regarded as something “elementary” and unproblematic. Obviously neither extra-linguistic modeling nor any kind of inner structure of similarity is recognized as important or even as possible. No particular transparent discourse about similarity and modeling is available from this field.

From these observations  it is possible in principle to derive two different, and mutually exclusive conclusions. First, we could conclude that similarity is irrelevant for understanding phenomena like language understanding or the empirical constitution. We don’t believe in that. Second, it could be that similarity represents a blind spot across several communities. Therefore we will try to provide a brief overview about some basic topics regarding the concept of similarity.

Etymology

Let us add some etymological considerations for a first impression. Words like “similar,” “simulation” or “same” derive all from proto-indoeuropean (PIE) base “*sem-/*som-“, which meant “together, one”, in Old English then “same.”  Yet, there is also the notion of “simulacrum” in the “same cloud”; the simulacrum is a central issue in the earliest pieces of philosophy of which we know (Platon) in sufficient detail.

The German word “ähnlich,” being the direct translation of “similar,” derives from Old German (althochdeutsch, well before ~1050 a.c.) “anagilith” [6], a composite from an- and gilith, meaning together something like “angleichen,” for which in English we find the words adapt, align, adjust or approximate, but also “conform to” or “blend.” The similarity to “sema” (“sign”) seems to be only superficial; it is believed that sema derives from PIE “dhya” [7].

If some items are said to be “similar,” it is meant that they are not “identical,” where identical means indistinguishable. To make them (virtually) indistinguish- able, they would have to be transformed. Even from etymology we can see that similarity needs an activity before it can be attested or assigned. Similarity is nothing to be found “there,” instead it is something that one is going to produce in a purposeful manner. This constructivist aspect is quite important for our following considerations.

Common Usage of Similarity

In this section, we will inspect the usage of the concept of “similarity” in some areas of particular relevance. We will visit cognitive psychology, information theory, data mining and statistical modeling.

Cognitive Psychology

Let us start with the terminology that has been developed in cognitive psychology, where one can find a rich distinction of the concept of similarity. It started with the work of Tversky [8], while Goldstone provides a useful overview more recently [9].

Tversky, a highly innovative researcher on cognition, tried to generalize the concept of similarity. His intention is to overcome the typical weakness of “geometric models”, which “[…] represent objects as points in some coordinate space such that the observed dissimilarities between objects correspond to the metric distances between the respective points.”1 The major assumption (and drawback) of geometric models is the (metric) representability in coordinate space. A typical representative of “geometric models” as Tversky calls them employs the nowadays widespread Euclidean distance as an operationalization for similarity.

A new set-theoretical approach to similarity is developed in which objects are represented as collections of features, and similarity is described as a feature-matching process. Specifically, a set of qualitative assumptions is shown to imply the contrast model, which expresses the similarity between objects as a linear combination of the measures of their common and distinctive features.

Tversky’s critique in the “geometrical approach” applies only if two restrictions are active: (1) If one would disregard missing values, which actually is the case for most of the practices. (2) If the dimensional interpretation is considered to be stable and unchangeable, no folding or warping of the data space via transformation of the measured data will be applied.

Yet, it is neither necessary to disregard missing values in a feature-based approach nor to dismiss dimensional warping. Here Tversky does not differentiate between form of representation and the actual rule for establishing the similarity relation. This conflation is quite abundant in many statements about similarity and its operationalization.

What Tversky effectively has been proposing is now known as binning. The  approach by him is based on features, though in a way quite different (at first sight) from our proposal, as we will show below. Yet, not the values of the features are compared, but instead the two sets on the level of the items by means of a particular ratio function. In a different perspective, the data scale used for assessing similarity is reduced to the ordinal or even the nominal scale. Tversky’s approach thus is prone to destroy information present in the “raw” signal.

An attribute (Tversky’s “feature”) that occurs in different grades or shades is translated into a small set of different, distinct and mutually exclusive features.  Tversky obviously does not recognize that binning is just one out of many, many possible ways to deal with observational data, i.e. to transform it. Applying a particular transformation based on some theory in a top-down manner is equivalent to the claim that the selected transformation builds a perfect filter for the actually given data. Of course, this claim is deeply inadequate (see the chapter about technical aspects of modeling). Any untested, axiomatically imposed algorithmic filter may destroy just those pieces of information that would have been vital to achieve a satisfying model. One simply can’t know before.

Tversky’s approach built on feature sets. The difference of those sets (on a nominal level) should represent the similarity and are expressed by the following formula:

s(a,b) = F(A ∩ B, A-B, B-A). eq.1

which Tversky describes in the following way:

The similarity of a to b is expressed as a function F of three arguments: A ∩ B, the features that are common to both a and b; A-B, the features that belong to a but not to b; B-A, the features that belong to b but not to a.

This formula reflects also what he calls “contrast.” (It is similar to Jaccard’s distance, so to speak, an extended practical version of it) Yet, Tversky, like any other member of the community of cognitive psychologist referring to this or a similar formula, did not recognize that the features, when treated in this way, are all equally weighted. It is a consequence of sticking to set theory. Again, this is just the fallback position of initial ignorance of the investigator. In the real world, however, features are differentially weighted, building a context. In the chapter about the formalization of the concept of context we propose a more adequate possibility to think about feature sets, though our concept of context shares important aspects with Tversky’s approach.

Tversky emphasizes that his concept does not consist from just one single instance or formula. He introduces weighting factors for the terms of eq.1, which then leads to families of similarity functions. To our knowledge this is the only instance (besides ours) arguing for a manifold regarding similarity. Yet, again, Tversky still does not draw the conclusion, that the chosen instance of a similarity “functional” (see below) has to be conceived just a hypothesis.

In cognitive psychology (even today), the term “feature-based models” of similarity does not refer to feature vectors as they are used in data mining, or even generalized vectors of assignates, as we proposed it in our concept of the generalized model. In Tversky’s article this becomes manifest on p.330. Contemporary psychologists like Goldstone [9] distinguish four different ways of operationalizing similarity: (1) geometric, (2) feature-based, (3) alignment-based, and (4) transformational similarity. For Tversky [8] and Goldstone, the label “geometric model” refers to models based on feature vectors, as they are used in data mining, e.g. as Euclidean distance.

Our impression is that cognitive psychologist fail to think in an abstract enough manner about features and similarity. Additionally, it seems that there is a tendency to the representationalist fallacy. Features are only recognized as features as far as they appear “attached” to the object for human senses. Dropping this attitude it becomes an easy exercise to subsume all those four types in a feature-vector approach, that (1) allows for missing values and assigns them a “cost”, and which (2) is not limited to primitive distance functions like Euclidean or Hamming distance. The underdeveloped generality is especially visible concerning the alignment or transformational subtype of similarity.

A further gap in the similarity theory in cognitive psychology is the missing separation between the operation of comparison and the operationalization of similarity as a projective transformation into a 0-dimensional space, that is a scalar (a single value). This distinction is vital, in our opinion, to understand the characteristics of comparison. If one does not separate similarity from comparison, it becomes impossible to become aware of higher forms of comparison.

Information theory

A much more profound generalization of similarity, at least at first sight, has been proposed by Dekang Lin [10], which is based on an “information-theoretic definition of similarity that is applicable as long as there is a probabilistic model.” The main advantage of this approach is its wide applicability, even in cases where only coarse frequency data are available. Quite unfortunately, Lin’s proposal neglects a lot of information if there are accurate measurements in the form of feature-vectors. Besides the comparison of strings and statistical frequency distributions, Lin’s approach is applicable to sets of features, but not to profile-based data, as we propose for our generalized model.

Data Mining

Data mining is an distinguished set of tools and methods that are employed in a well-organized manner in order to facilitate the extraction of relevant patterns [11], either for predictive or for diagnostic purposes. Data Mining (DM) is often conceived as a part of  so-called “knowledge discovery,” building the famous abbreviation KDD: knowledge discovery in databases [11]. In our opinion, the term “data mining” is highly misleading, and “knowledge discovery” even deceptive. In contrast to earthly mining, in the case of information the valuable objects are not “out there” like minerals or gems, while knowledge can’t be “discovered” like diamonds or physical laws. Even the “retrieval” of information is impossible by principle. To think otherwise dismisses the need of interpretation and hence contradicts widely acknowledged positions in contemporary epistemology. One has to know that the terms “KDD” and “data mining” are shallow marketing terms, coined to release the dollars of naive customers. Yet, KDD and DM are myths many people believe in and which are reproduced in countless publications. As concepts, they simply remain to be utter nonsense. As a non-sensical practice that is deeply informed by positivism, it is harmful for society. It is more appropriate to call the respective activity more down-to-earth just diagnostic or predictive modeling (which actually is equivalent).

Any observation of entities takes place along apriori selected properties, often physical ones. This selection of properties is part of the process of creating an operationalization, which actually means to make a concept operable through making it measurable. Actually, those properties are not “natural properties of objects.” Quite to the contrast, objecthood is created by the assignment of a set of features. This inversion is often overlooked in data mining projects, and consequently also the eminently constructive characteristics of data-based modeling. Hence, it is likewise also not correct to call it “data analysis”: an analysis does not add anything. Predictive/ diagnostic models are constructed and synthesized like small machines. Models may well be conceived as an informational machinery. To make our point clear: nobody among the large community of machine-building engineers would support the view that any machine comes into existence just through “analysis.”

Given the importance of similarity in comparison, it is striking to see that in many books about data mining the notion of “similarity” does not appear even a single time [e.g. 12], and in many more publications only in a very superficial manner. Usually, it is believed that the Euclidean distance is a sound, sufficient and appropriate operationalization of similarity. Given its abundance, we have to take a closer look to this concept, how it works, and how it fails.

Euclidean Distance and its Failure

We already met the idea that objects are represented along a set of selected features. In the chapter about comparison we saw that in order to compare items of a population of objects, those objects are to be compared on the basis of a selected and shared feature set. Next, it is clear that for each of the features some values can be measured. For instance, presence could be indicated by the dual pair of values 1/0. For nominal values like names re-scaling mechanisms have been proposed [13]. Such, any observation can be transformed into a table of values, where the (horizontal) rows represent the objects and the columns describe the features.

We also can say that any of the objects contained in such a table is represented by a profile. Note that the order of the columns (features) is arbitrary, but it is also constant for all of the objects covered by the table.

The idea now is that each of the columns represent a dimension in a Cartesian, orthogonal coordinate system. As a preparatory step, we normalize the data, i.e. for each single column the values contained in it are scaled such that the ratios remain unchanged, but the absolute values are projected into the interval [0..1].

By means of such a representation any of the objects (=data rows) can be conceived as a particular point in the space spanned by the coordinate system. The similarity S then is operationalized as the “inverse” of the distance, S=1-d, between any of the points. The distance can be calculated according to the Euclidean formula for the length of the hypotenuse in the orthogonal triangle (2d case). In this way, the points are understood as the endpoint of a vector that starts in the origin of the coordinate system. Thus, this space is often called “data space” or “vector space.” The distance is called “Euclidean distance.”

Since all of the vectors are within the unit sphere (any value is in [0..1]), there is another possibility for an operationalization of the similarity. Instead of the distance one could take the angle between any two of those vectors. This yields the so-called cosine-measure of (dis-)similarity.

Besides the fact that missing values are often (and wrongly) excluded from a feature-vector-based comparison, this whole procedure has a serious built-in flaw, whether as cosine- or as Euclidean distance.

The figure 1a below shows the profiles of two objects above a set of assignates (aka attributes, features, properties, fields). The embedding coordinate space has k dimensions. One can see that the black profile (representing object/case A) and the red profile (representing object/case B) are fairly similar. Note that within the method of the Euclidean distance all ai are supposed to be independent from each other.

Figure 1a: Two objects A’ and B’ has been represented as profiles A, B across a shared feature vector ai of size k ;

Next, we introduce a third profile, representing object C. Suppose that the correlation between profiles A and C is almost perfect. This means that the inner structure of objects A and C could be considered to be very similar. Some additional factor just might have damped the signal, such all values are proportionally lower by an almost constant ratio when compared to values measured from object A.

Figure 1b: Compared to figure 1a, a third object C’ is introduced as a profile C; this profile causes a conflict about the order that should be induced by the similarity measure. There are (very) good reasons, from systems theory as well as from information theory, to consider A and C more similar to each other than either A-B or B-C. Nevertheless, employing Euclidean distance will lead to a different result, rating the pairing A-B as the most similar one.

The particular difficulty now is given by the fact, that it depends on some objections that are completely outside of the chosen operationalization of similarity, which two pairs of observations are considered more similar to each other. Yet, this dependency inverses the intended arrangement of the empiric setup. The determination of the similarity actually should be used to decide about those outside objections. Given the Euclidean distance, A and B are clearly much more similar to each other than either  A-C or B-C. Using in contrast a correlative measure would select A-C as the most similar pairing. This effect gets more and more serious the more assignates are used to compare the items.

Now imagine that there are many observations, dozens, hundreds or hundreds of thousands, that serve as a basis for deriving an intensional description of all observations. It is quite obvious that the final conclusions will differ drastically upon the selection of the similarity measure. The choice of the similarity measure is by no means only of technical interest. The particular problematics, but also, as we will see, the particular opportunity that is related to the operationalization of similarity consists in the fact that there is a quite short and rather strong link between a technical aspect and the semantic effect.

Yet, there are measures that reflect the similarity of the form of the whole set of items more appropriately, such like least-square distances, or measures based on correlation, like the Mahalanobis distance. However, these analytic measures have the disadvantage of relying to certain global parametric assumptions, such as normal distribution. And they do not completely resolve the situation shown in figure 1b even in theory.

We just mentioned that the coherence of value items may be regarded as a form. Thus, it is quite natural to use a similarity measure that is derived from geometry or topology, which also does not suffer from any particular analytic apriori assumption. One such measure is the Hausdorff metric, or more general the Gromov-Hausdorff metric. Being developed in geometry they find their “natural” application in image analysis, such as partial matching of patterns to larger images (aka finding “objects” in images). For the comparison of profiles we have to interpret them as figures in a 2-dimensional space, with |ai|/2 coordinate points. Two of such figures are then prepared to be compared. The Hausdorff distance is also quite interesting because it allows to compare whole sets of observations, not only as two paired observations (profiles) interpreted as coordinates in ℝ2, but also three observations as ℝ3, or a whole set of n observations, arranged as a table, as a point cloud in ℝn. Assuming compactness, i.e. a macroscopic world without gaps, we may interpret them also as curves. This allows to compare whole sub-sets of observations at once, which is a quite attractive feature for the analysis of relational data. As far as we know, nobody ever used the Hausdorff metric in this way.

Epistemologically, it is interesting that a topologically inspired assessment of data provides a formal link between feature-based observations and image processing. Maybe, this is relevant for the subjective impression to think in “images,” though nobody has ever been able to “draw” such an image… This way, the idea of “form” in thought could acquire a significant meaning.

Yet, already in his article published more than 30 years ago, Tversky [8] mentioned that the metric approach is barely convincing. He writes (p.329)

The applicability of the dimensional assumption is limited, […] minimality is somewhat problematic, symmetry is apparently false, and the triangle inequality is hardly compelling.

It is of utmost importance to understand that the selection of the similarity measure as well as the selection of the feature to calculate it are by far the most important factors in the determination, or better hypothetical  presupposition, of the similarity between the profiles (objects) to be compared. The similarity measure and feature selection is by far more important than the selection of a particular method, i.e. a particular way of organizing the application of the similarity measure. Saying “more important” also means that the differences in the results are much larger between different similarity measures than between methods. From a methodological point of view it is thus quite important that the similarity measure is “accessible” and not buried in a “heap of formula.”

Similarity measures that are based only on dimensional interpretation and coordinate spaces are not able to represent issues of form and differential relations, what is also (and better) known as “correlation.” Of course, other approaches different from correlation that would reflect the form aspect of the internal relations of a set of variables (features) would do the job, too. We just want to emphasize that the assumption of perfect independence among the variables is “silly” in the sense that it contradicts the “game” that the modeler actually pretends to play. This leads more often than not to irrelevant results. The serious aspect about this is, however, given by the fact that this deficiency remains invisible when comparing results between different models built according to the Euclidean dogma.

There is only one single feasible conclusion from this: Similarity can’t be regarded as property of actual pairings of objects. The similarity measure is a free parameter in modeling, that is, nothing else than a hypothesis, though on the structural level. As a hypothesis, however, it needs to be tested for adequacy.

Similarity in Statistical Modeling

In statistical modeling the situation is even worse. Usually, the subject of statistical modeling is not the individual object or its representation. The reasoning in statistical modeling is strongly different from modeling in predictive modeling. Statistics compares populations, or at least groups as estimates of populations. Dependent on the scale of the data, the amount of data, the characteristics of data and the intended argument a specialized method has to be picked from a large variety of potential methods. Often, the selected method also has to be parameterized. As a result, the whole process of creating a statistical model is more “art” than science. Results of statistical “analysis” are only approximately reproducible across analysts. It is indeed kind of irony that at the heart of quantitative science one finds a non-scientific methodological core.

Anyway, our concern is similarity. In statistical modeling there is no similarity function visible at all. All that one can see is the result and proposals like “population B is not part of population A with a probability for being false positive of 3%.” Yet, the final argument that populations can be discerned (or can’t) is obviously also an argument about the probability for a correct assignment of the members of the compared populations. Hence, it is also clearly an argument about the group-wise as well as the individual similarity of the objects. The really bad thing is the similarity function is barely visible at all. Often it is some kind of simple difference between values. The main point is that it is not possible to parametrize the hidden similarity function, except by choosing the alpha level for the test. It is “distributed” across the whole procedure of the respective method. In its most important aspect, any of the statistical methods has to be regarded as a black box.

These problems with statistical modeling are prevalent even across the general framework, i.e. whether one chooses a frequentist or a Bayesian attitude. Recently, Alan Hajek [14] proofed that statistics is a framework that in all its flavors suffers from the reference class problem. Cheng [15] correctly notes about the reference class problem that

“At its core, it observes that statistical inferences depend critically on how people, events, or things are classified. As there is (purportedly) no principle for privileging certain categories over others, statistics become manipulable, undermining the very objectivity and certainty that make statistical evidence valuable and attractive …”

So we can see that the reference class problem is just a corollary of the fact that the similarity function is not given explicitly and hence also is not accessible. Thus, Cheng seeks unfulfillable salvation by invoking the cause of defect itself: statistics. He writes

I propose a practical solution to the reference class problem by drawing on model selection theory from the statistics literature.

Despite he is right in pointing to the necessity of model selection, he fails to recognize that statistics can’t be helpful in this task. We find it interesting that this author (Cheng) has been writing for the community of law theoreticians. This sheds bright light onto the relevance of an appropriate theory of modeling.

As a consequence we conclude that statistical methods should not be used as the main tool for any diagnostic/predictive modeling of real-world data. The role of statistical methods in predictive/diagnostic modeling is just the same as that of any other transformation: they are biased filters, whose adequacy has to be tested, nothing less, and, above all, definitely nothing more. Statistics should be used only within completely controllable, hence completely closed environments, such like simulations, or “data experiments.”

The Generalized View

Before we are going to start we would like to recall the almost trivial aspect that the concept of similarity makes sense exclusively in the context of diagnostic/ predictive modeling, where “modeling” refers to the generalized model, which in turn is part of a transcendental structure.

After having briefly discussed the relation of the concept of similarity to some major domains of research, we now may turn to the construction/description of a proper concept of similarity. The generalized view that we are going to argue for should help determining the appropriate mode of speaking about similarity.

Identity

Identity is often seen as the counterpart of similarity, or also as some kind of a simple asymptotical limit to it. Yet, these two concepts are so deeply incommen-surable that they can not be related at all.

One could suggest that identity is a relation that indicates a particular result of a comparison, namely indistinguishability. We then also could say that under any possible transformation applied to identical items the respective item remain indistinguishable. Yet, if we compare two items we refer to the concept of similarity, from which we want to distinguish it. Thus it is clear that identity and similarity are structurally different. There is no way from one to the other.

In other words, the language game of identity excludes any possibility for a comparison. We can set it only by definition, axiomatically. This means that not only the concepts can’t be related to each other, additionally he see that the subjects of the two concepts are categorically different. Identity is only meaningful as an axiomatically introduced equality of symbols.

In still other words we could say that identity is restricted to axiomatically defined symbols in formal surrounds, while similarity is applicable only in empirical contexts. Similarity is not about symbols, but about measurement and the objects constructed from it.

This has profound consequences.

First, identity can’t be regarded as a kind of limit to which similarity would asymptotically approximate. For any two objects that have been rated as being “equal,” notably through some sort of comparison, it is thus possible to find a perspective under which they are not equal any more.

Second, it is impossible to take an existential stance towards similarity. Similarity is a the result of an action, of a method or technique that is embedded in a community. Hence it is not possible to assign similarity an ontic dimension. Similarity is not part of any possible ontology.

We can’t ask “What is similarity?”, we also can not even pretend to determine “the” similarity of two subjects. “Similarity” is a very particular language game, much like its close relatives like vagueness. We only can ask “How to speak about similarity?”

Third, it is now clear that there is no easy way from a probabilistic description to a propositional reference. We already introduced this in another chapter, and we will deal dedicatedly elsewhere with it. There is no such transition within a single methodology. We just see again how far Wittgenstein’s conclusion about the relation of the world and logic is reaching. The categorical separation between identity and similarity, or between the empiric and the logic can’t be underestimated. For our endeavor of a machine-based epistemology it is of vital interest to find a sound theory for this transition, which in any of the relevant research areas has not been even recognized as a problematic subject so far.

Practical Aspects

Above we have seen that any particular similarity measure should be conceived as part of a general hypothesis about the best way to create an optimized model. Within such a hypothesizing setting we can distinguish two domains embedded into a general notion of similarity. We could align these two modes to the distinction Peirce introduced with regard to uncertainty: probability and verisimilitude [16]. Such, the first domain regards the variation along assignates that are shared among two items. From the perspective of any of the compared items, there is complete information about the extension of the world of that item. Any matter is a matter of degree and probability, as Peirce understood it. Taking the perspective of Deleuze, we could call it also possibility.

The second domain is quite different. It is concerned with the difference of the structure of the world as it is accessible for each of the compared items, where this difference is represented by a partial non-matching of the assignates that provide the space for measurement. Here we meet Tversky’s differential ratio that builds upon differences in the set of assignates (“features,” as he called it) and can be used also to express differential commonality.

Yet, the two domains are not separated from each other in an absolute manner. The logarithm is a rather simple analytic function with some unique properties. For instance, it is not defined for argument values [-∞..0].  The zero (“0”), however, in turn can be taken to serve as a double articulation that allows to express two very different things: (1) through linear normalization the lowest value of the range, and (2) the (still symbolic) absence of a property. Using the logarithm then, the value “0” gets transformed into a missing value, because the logarithm is not defined for arg=0, that is, we turn the symbolic into a quasi-physical absence. The extremely (!) valuable consequence of this is that by means of the logarithmic transformation we can change the feature vector on the fly in a context dependent manner, where “context” (i) can denote any relation between variables of values therein, and (ii) may be related to certain segments of observations. Even the (extensional) items within a (intensional) class or empirical category may be described by dynamically regulated sets of assignates (features). In other words, the logarithmic transformation provides a plain way towards abstraction. Classes as clusters are not just comprising items homogenized by the identical feature set. Hence, it is a very powerful means in predictive modeling.

Given the two domains in the practical aspects of similarity measures it is now becoming more clear that we indeed need to insist on a separation of assignates and the mapping similarity function, as we did in the chapter about comparison. We reproduce Figure 2b from that chapter:

Figure 2: Schematic representation of the comparison of two items. Items are compared along sets of “attributes,” which have to be assigned to the items, indicated by the symbols {a} and {b}.

The set of assignates symbolized as {a} or {b} for items A, B don’t comprise just the “observable” raw “properties.” Of course, all those properties are selected and assigned by the observer, which results in the fact that the observed items are literally established only through this measurement step. Additionally, the assigned attributes, or better “assignates,” comprise also all transformations of raw , primary assignates, building two extended sets of assignates. The similarity function then imposes a rule for calculating the scalar (a single value) that finally serves as a representation of the respective operationalization. This function may represent any kind of mapping between the extended set of assignates.

Such a mapping (function) could consists of a compound of weighted partial functions, according to the proposals of Tversky or Cheng and a particular profile mapping. Sets {a} and {b} need not be equal, of course. One could even apply the concept of formalized contexts instead of a set of equally weighted items. Nevertheless, there remains the apriori of the selection of the assignates, that precedes the calculation of the scalar. In practical modeling this selection will almost for sure lead to a removal of most of the measured “features.”

Above we said that any similarity measure must be considered as a free parameter in modeling, that is, as nothing else than a hypothesis. For the sake of abstraction and formalization this requires that we generalize the single similarity function into a family of functions, which we call “functional.” In category theoretic terms we could call it also a “functor.” The functor of all similarity functions then would be part of the functor representing the generalized model.

Formal Aspects

In the chapter about the category of models we argued that models can not be conceived in a set theoretic framework. Instead, we propose to describe models and the relations among them on the level of categories, namely the category of functors. In plain words, models are transformative relations, or in a term from category theory, arrows. Similarity is a dominant part of those transformative relations.

Before this background (or: within this framing), we could say that similarity is a property of the arrow, while a particular similarity function represents a particular transformation. By expressing similarity as a value we effectively map these properties of the arrow to an scalar, which could be a touchable value or which could be an abstract scalar. Even more condensed, we could say that in a general perspective:

Similarity can be conceived as a mapping of relations onto a scalar.

This scalar should not be misunderstood as the value of the categorical “arrow.” Arrows in category theory are not vectors in a coordinate system. The assessment of similarity thus can’t be taken just as kind of a simple arithmetic transformation. As we already said above from a different perspective, similarity is not a property of objects.

Since similarity makes sense only in the context of comparing, hence in the context of modeling, we also can recognize that the value of this scalar is dependent on the purpose and its operationalization, the target variable. Similarity is nothing which could be measured. For 100% it is the result of an intention.

Similarity and the Symbolic

It is more appropriate to understand it as the actualization of a potential. Since the formal result of this actualization is a scalar, i.e. a primitive with only a simple structure, this actualization prepares also the ground for the possibility of a new symbolization. The similarity scalar is able to take three quite different roles. First, it can act as a criterion to impose a differential semi-order under ceteris paribus conditions for modeling. Actual modeling may be strongly dominated by arbitrary, but nevertheless stable habits. Second, the similarity scalar also could be taken as an ultimate “abbreviation” of a complex activity. Third, and finally, the scalar may well appear as a quasi-material entity due to the fact that there is so little inner structure to it.

It is the “Similarity-Game” that serves as a ground for hatching symbols.

Imagine playing this game according to the Euclidean rules. Nobody could expect rich or interesting results, of course. The same holds if “similarity” is misunderstood as technical issue, which could be represented or determined as a closed formalism.2

It is clear, that these results are also quite important to understand the working of metaphors in practiced language. Actually, we think that there is no other mode of speaking in “natural,” i.e. practiced languages than the metaphorical mode. The understanding of similarity as a ground for hatching symbols directly leads to the conclusion that words and arrangements of words do not “represent” or “refer to” something. Even more concise we may say that neither things nor signs or symbols are able to carry references. Everything is created in the mind. Yet, and still refuting radical constructivism, we suggest that the tools for this creative work are all taken from the public.

Conclusions

As usual, we finally address  the question about the relevance of our achieved results for the topic of machine-based epistemology.

From a technical perspective, the most salient insight is probably the relativity of similarity. This relativity renders similarity into a strictly non-ontological concept (we anyway think that the idea of “pure” ontology is based on a great misunderstanding). Despite the fact that it is pretended thousands of times each day that “the” similarity has been calculated, such “calculation” is not possible. The reason for this is simply that (1) it is not just a calculation as for instance, the calculation of the Easter date, and (2) there is nothing like “the” similarity that could be calculated.

In any implementation that provides means for the comparison of items we have to care for an appropriate generality. Similarity should never be implemented as formula, but instead as an (templated, or abstract) object. Another (not only) technical aspect concerns the increasing importance of the “form-factor” when comparing profiles the more assignates are used to compare the items. This should be respected in any implementation of a similarity measure by increasing the weight of such “correlational” aspects.

From a philosophical perspective there are several interesting issues to mention. It should be clear that our notion of similarity is not following the realist account. Our notion of similarity is not directed towards the relation of “objects” in the “physical world” and “concepts” in the “mental world.” Please excuse the inflationary usage of quotation marks, yet it is not possible otherwise to repel realism in such sentences. Indeed, we think that the similarity can’t be applied to concepts at all. Trying to do so [e.g. 17] one would commit a double categorical mistake: First, concepts may arise exclusively as an embedment (not:entailment) of symbols, which in turn require similarity as an operational field. It is impossible to apply similarity to concepts without further interpretation. Second, concepts can’t be positively determined and they are best conceived as transcendental choreostemic poles. This categorically excludes the application of the concept of similarity to the concept of concepts. A naturalization by means of (artificial) neuronal structures [e.g. 18] is missing the point even more dramatically.3 “Concept” and “similarity” are mutually co-extensive, “similarly” to space and time.

As always, we think that there is the primacy of interpretation, hence it is useless to talk about a “physical world as it is as-such.” We do not deny that there is a physical outside, of course. Through extensive individual modeling that is not only shared by a large community, but also has to provide some anticipatory utility, we even may achieve insights, i.e. derive concepts, that one could call “similar” with regard to a world. But again, this would require a position outside of the world AND outside of the used concepts and practiced language. Such a position is not available. “Objects” do not “exist” prior to interpretation. “Objecthood” derives from (abstract) substance by adding a lot of particular, often “structural,” assignates within the process of modeling, most salient by imposing purpose and the respective instance of the concept of similarity.

There are two conclusions from that. First, similarity is a purely operational concept, it does not imply any kind of relation to an “existent” reference. Second, it would be wrong to limit similarity (and modeling, anticipation, comparison, etc.) to external entities like material “objects.” With the exception of pure logic, we always have to interpret. We interpret by using a word in thought, we interpret even if our thoughts are shapeless. Thinking is an open, processual system of cascaded modeling relations. Modeling starts with the material interaction between material aspects of “objects” or bodies, it takes place throughout the perception of external differences, the transduction and translation of internal signals, the establishment of intensions and concepts in our associative networks, up to the ideas of inference and propositional content.

Our investigations ended by describing similarity as a scalar. This dimensionless appearance should not be misunderstood in a representatio­nalist manner, that is, as indication that similarity does not have a structure. Our analysis revealed important structural aspects that relate to many areas in philosophy.

In other chapters we have seen that modeling and comparing are inevitable actions. Due to their transcendental character we even may say that they are inevitable events. As subjects, we can’t evade the necessity of modeling. We can do it in a diagnostic attitude, directed backward in time, or we can do it in a predictive or anticipatory attitude, directed forward in time. Both directions are connected through learning and bridged by Peirce’s sign situation, but any kind of starting point reduces to modeling. If spelled out in a sudden manner it may sound strange that modeling and comparing are deeply inscribed into the event-structure of the world. Yet, there are good reasons to think so.

Before this background, similarity denotes a hot spot for the actualization of intentions. As an element in modeling it is the operation to transport purposes into the world and its perception. Even more concentrated we may call similarity the carrier of purpose. For all other elements of modeling besides the purpose and similarity one can refer to “necessities,” such like material constraints, or limitations regarding time and energy. (Obviously, contingent selections are nothing one can speak about in other ways than just by naming them, they are singularities.)

Saying this it is is clear that the relative neglect of similarity against logic should be corrected. Similarity is the (abstract) hatching-ground for symbols, so to say, in Platonic terms, the sky for the ideas.

Notes

1. The geometrical approach is largely equal to what today is known as the feature vector approach, which is part of any dimensional mapping. Examples are multi-dimensional scaling, principal component analysis, or self-organizing maps.

2. Category theory provides a formalism that is not closed, since categories can be defined in terms of category theory. This self-referentiality is unique among formal approaches. Examples for closed formalisms are group theory, functional analysis or calculi like λ-calculus.

3. Besides that, Christoph Gauker provided further arguments that concepts could not conceived as regions of similarity spaces [19].

  • [1]  McGill, M.,Koll, M., and Noreault, T. (1979). An evaluation of factors affecting document ranking by information retrieval systems. Final report for grant NSF-IST-78-10454 to the National Science Foundation, Syracuse University.
  • [2] Wesley C. Salmon.
  • [3] van Rooij, Robert. 2011c. Vagueness and linguistics. In: G. Ronzitti (ed.), The vagueness handbook, Springer New York, 2011.
  • [4] Lakoff
  • [5] Haverkamp (ed.), Metaphorologie
  • [6] Duden. Das Herkunftswörterbuch. Die Etymologie der Deutschen Sprache. Mannheim 1963.
  • [7] Oxford Encyclopedia of Semiotics: Semiotic Terminology. available online, last accessed 29.12.2011.
  • [8] Amos Tversky (1977), Features of Similarity. Psychological Review, Vol.84, No.4. available online
  • [9] Goldstone. Comparison. Springer, New York  2010.
  • [10] Dekang Lin, An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th International Conference on Machine Learning ICML, 1998, pp. 296-304. download
  • [11] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). From Data Mining to Knowledge Discovery in Databases. American Association for Artificial Intelligence, p.37-54.
  • [12] Thomas Reinartz, Focusing Solutions for Data Mining: Analytical Studies and Experimental Results in Real-World Domains (LNCS) Springer, Berlin 1999.
  • [13]  G. Nakaeizadeh (ed.). Data mining. Physica Weinheim, 1998.
  • [14] Alan Hájek (2007), The Reference Class Problem is Your Problem Too. Synthese 156: 185-215. available online.
  • [15] Edward K. Cheng (2009), A Practical Solution to the Reference Class Problem. COLUMBIA LAW REVIEW Vol.109:2081-2105. download
  • [16] Peirce, Stanford Encyclopedia.
  • [17] Tim Schroeder (2007), A Recipe for Concept Similarity. Mind & Language, Vol.22 No.1. pp. 68–91.
  • [18] Churchland
  • [19] Christoph Gauker (2007), A Critique of the Similarity Space Theory of Concepts. Mind & Language, Vol.22 No.4, pp.317–345.

۞

Vagueness: The Structure of Non-Existence.

December 29, 2011 § Leave a comment

For many centuries now, clarity has been the major goal of philosophy.

updated version featuring new references

It drove the first instantiation of logics by Aristotle, who devised it as a cure for mysticism, which was considered as a kind of primary chaos in human thinking. Clarity has been the intended goal in the second enlightenment as a cure for scholastic worries, and among many other places we find it in Wittgenstein’s first work, now directed to philosophy itself. In any of those instances, logics served as a main pillar to follow the goal of clarity.

Vagueness seems to act as an opponent to this intention, lurking behind the scenes in any comparison, which is why one may regard it as being as ubiquitous in cognition. There are whole philosophical and linguistic schools dealing with vagueness as their favorite subject. Heather Burnett (UCLA) recently provided a rather comprehensive overview [1] about the various approaches, including own proposals to solve some puzzles of vagueness in language, particularly related to relative and absolute adjectives and their context-dependency. In the domain of scientific linguistics, vagueness is characterized by three related properties: being fuzzy, being borderline, or being susceptible to the sorites (heap) paradox. A lot of rather different proposals for a solution have been suggested so far [1,2], most of them technically quite demanding; yet, none has been generally accepted as a convincing one.

The mere fact that there are many incommensurable theories, models and attitudes about vagueness we take as a clear indication for a still unrecognized framing problem. Actually, in the end we will see that the problem of vagueness in language does not “exist” at all. We will provide a sound solution that does not refer just to the methodological level. If we replace vagueness by the more appropriate term of indeterminacy we readily recognize that we can’t speak about vague and indeterminate things without implicitly talking about logics. In other words, the issue of (non-linguistic) vagueness triggers the question about the relation between logics and world. This topic we will investigate elsewhere.

Regarding vagueness, let us consider just two examples. The first one is about Peter Unger’s famous example regarding clouds [3]. Where does a cloud end? This question can’t be answered. Close inspection and accurate measurement does not help. It seems as if the vagueness is a property of the “phenomenon” that we call “cloud.” If we conceive it as a particular type of object, we may attest it a resemblance to what is called an “open set” in mathematical topology, or the integral on asymptotic functions. Bertrand Russell, however, would have called this the fallacy of verbalism [4, p.85].

Vagueness and precision alike are characteristics which can only belong to a representation, of which language is an example. […] Apart from representation, whether cognitive or mechanical, there can be no such thing as vagueness or precision;

For Russell, objects can’t be ascribed properties, e.g. vague. Vague is a property of the representation, not of the object. Thus, when Unger concludes that there are no ordinary things, he gets trapped even by several misunderstandings, as we will see. We could add that open sets, i.e. sets without definable border, are not vague at all.

As the second example we take an abundant habit in linguistics when addressing the problem of vagueness, e.g. supervaluationism. This system has the consequence that borderline cases of vague terms yield statements that are neither true, nor false. Despite there is a truth-value gap induced by that model, it nevertheless keeps the idea of truth values fully intact. All linguistic models about vagueness assume that it is appropriate to apply the idea of truth values, predicates and predicate logics to language.

As far as I can tell from all the sources I have inspected, any approach in linguistics about vagueness is taking place within two very strong assumptions. The first basic assumption is that (1) the concept of “predicates” can be applied to an analysis of language. From that basic assumption, three other more secondary derive. (1.1) Language is a means to transfer clear statements. (1.2) It is possible to use language in a way that no vagueness appears. (1.3) Words are items that can be used to build predicates.

Besides of this first assumption of “predicativity” of language, linguistics further assumes that words could be definite and non-ambiguous. Yet, that is not a basic assumption. The basic second assumption of that is that (2) the purpose of language is to transfer meaning unambiguously. Yet, all three aspects of that assumption are questionable, being a purpose, serving as a tool or even a medium to transfer meaning, and to do so unambiguously.

So we summarize: Linguistics employs two strong assumptions:

  • (1) The concept of apriori determinable “predicates” can be applied to an analysis of language.
  • (2) The purpose of language is to transfer meaning unambiguously.

Our position is that both assumptions are deeply inappropriate. The second one we already dealt with elsewhere, so we focus on the first one here. We will see that the “problematics of vagueness” is non-existent. We do not claim that there is no vagueness, but we refute that it is a problem. There are also no serious threats from linguistic paradoxes, because these paradoxes are simply a consequence from “silly” behavior.

We will provide several examples to that, but the structure of it is the following. The problematics consists of  a performative contradiction to the rules one has set up before.  One should not pretend to play a particular game by fixing the rules upon one’s own interests, only to violate those rules a moment later. Of course, one could create a play / game from this, too. Lewis Carroll wrote two books about the bizarre consequences of such a setting. Let us again listen to Russell’s arguments, now to his objection against the “paradoxicity” of “baldness,” which is usually subsumed to the sorites (heap) paradox.

It is supposed that at first he was not bald, that he lost his hairs one by one, and that in the end he was bald; therefore, it is argued, there must have been one hair the loss of which converted him into a bald man. This, of course, is absurd. Baldness is a vague conception; some men are certainly bald, some are certainly not bald, while between them there are men of whom it is not true to say they must either be bald or not bald. The law of excluded middle is true when precise symbols are employed, but it is not true when symbols are vague, as, in fact, all symbols are.

Now, describing the heap (Greek: sorites) or the hair of “balding” men by referring to countable parts of the whole, i.e. either sand particles or singularized hairs, contradicts the conception of baldness. Confronting both in a direct manner (removing hair by hair) mixes two different games. Mixing soccer and tennis is “silly,” especially after the participants have declared that they intend to play soccer, mixing vagueness and counting is silly, too, for the same reason.

This should make clear why the application of the concept of “predicates” to vague concepts, i.e. concepts that are apriori defined as to be vague, is simply absurd.  Remember, even a highly innovative philosopher as Russell, co-author of an extremely abstract work as the Principia Mathematica is, needed several years to accept Wittgenstein’s analysis that the usage of symbols in the Principia is self-contradictory, because actualized symbols are never free of semantics.

Words are Non-Analytic Entities

First I would recall an observation first, or at least popularly, expressed by Augustinus. His concern was the notion of time. I’ll give a sketch of it in my words. As long as he simply uses the word, he perfectly knows what time is. Yet, as soon as he starts to think about time, trying to get an analytic grip onto it, he increasingly looses touch and understanding, until he does not know anything about it at all.

This phenomenon is not limited to the analysis of a concept like time, which some conceive even as a transcendental “something.” The phenomenon of disappearance by close inspection is not unknown. We meet it in Carroll’s character of the Cheshire cat, and we meet it in Quantum physics. Let us call this phenomenon the CQ-phenomenon.

Ultimately, the CQ-phenomenon is a consequence of the self-referentiality of language and self-referentiality of the investigation of language. It is not possible to apply a scale to itself without getting into some serious troubles like fundamental paradoxicity. The language game of “scale” implies a separation of observer and observed that can’t be maintained in the cases of the cat, the quantum, or language. Of course, there are ways to avoid such difficulties, but only to high costs. For instance, a strong regulations or very strict conventions can be imposed to the investigation of such areas ad the application of self-referential scales, to which one may count linguistics, sociology, cognitive sciences, and of course quantum physics. Actually, positivism is nothing else than such a “strong convention”. Yet, even with such strong conventions being applied, the results of such investigations are surprising and arbitrary, far from being a consequence of rationalist research, because self-referential system are always immanently creative.

It is more than salient that linguists create models about vagueness that are subsumed to language. This position is deeply non-sensical and does not only purport ontological relevance for language, it implicitly also claims a certain “immediacy” for the linkage between language and empirical aspects of the world.

Our position is strongly different from that: models are entities that are “completely” outside of language. Of course, they are not separable from each other. We will deal elsewhere with this mutual dependency in more details and a more appropriate framing. Regardless how modeling and language are related, they definitely can not be related in the way linguistics implicitly assumes. It is impossible to use language to transfer meaning, because it is in principle not possible to transfer meaning at all. Of course, this opens the question what then is going to be “transferred.”

This brings us to the next objection against the presumed predicativity of language, namely its role in social intercourse, from which the CQ-phenomenon can’t be completely separated from.

Language: What can be Said

Many things and thoughts are not explicable. Many things also can be just demonstrated, but not expressed in any kind of language. Yet, despite these two severe constraints, we may use language not only to explicitly speak about such things, but also to create what only can be demonstrated.

Robert Brandom’s work [5] may be well regarded as a further leap forward in the understanding of language and its practitioning. He proposes the inferentialist position, to which our positioning of the model is completely compatible. According to Brandom, we always have to infer a lot of things from received words during a discourse. We even have to signal that we expect those things to be inferred. The only thing what we can try in a language-based interaction is to increasingly confine the degrees of freedom of possible models that are created in the interactees’ minds. Yet, achieving a certain state of resonance, or feeling that one understands each other, does NOT imply that the models are the identical. All what could be said is that the resonating models in the two interacting minds allow a certain successful prediction of the further course of the interaction. Here, we should be very clear about our understanding of the concept of model. You will find it in the chapters about the generalized model and the formal status of models (as a category).

Since Austin [6] it is well-known that language is not equal to the series of graphical of phonic signals. The reason for this simply being that language is a social activity, both structural as well as performative. An illocutionary act is part of any utterance and any piece of text in a natural language, sometimes even in the case of a formal language. Yet, it is impossible to speak about that dimension in language.

A text is even more than a “series” of Austinian or Searlean speech acts. The reason for this is a certain aspect of embodiment: Only entities stuffed with memory can use language. Now, receiving a series of words immediately establishes a more or less volatile immaterial network in the “mind” of the receiving entity as well as in the “sending” entity. This network owns properties for which it is absolutely impossible to speak about, despite the fact that these networks represent somehow the ultimate purpose, or “essence”, of natural language. We can’t speak about that, we can’t explicate it, and we simply commit a categorical mistake if we apply logics and tools from logics like predicates in the attempt to understand it.

Logics and Language

These phenomena clearly proof that logics and language are different things. They are deeply incommensurable, despite the fact that they can’t be separated completely from each other, much like modeling and language. The structure of the world shows up in the structure of logics, as Wittgenstein mentioned. There are good reasons to take Wittgenstein serious on that. According to the Tractatus, the coupling between world and logics can’t be a direct one [7].

In contrast to the world, logics is not productive. “Novelty” is not a logical entity. Pure logics is a transcendental system about usage of symbols, precisely because any usage already would require interpretation. Logical predicates are nothing that need to be interpreted. These games are simply different games.

In his talk to the Jowett Society, Oxford, in 1923, Bertrand Russell, exhibiting an attitude quite different to that in the Principia and following much the line drawn by Wittgenstein, writes [p.88]:

Words such as “or” and “not” might seem at first sight, to have a perfectly precise meaning: “p or q'” is true when p is true, true when q is true, and false when both are false. But the trouble is that this involves the notions of “true” and “false”; and it will be found, I think, that all the concepts of logic involve these notions, directly or indirectly. Now “‘true” and “false” can only have a precise meaning when the symbols employed—words, perceptions, images, or what not—are themselves precise. We have seen that, in practice, this is not the case. It follows that every proposition that can be framed in practice has a certain degree of vagueness; that is to say, there is not one definite fact necessary and sufficient for its truth, but a certain region of possible facts, any one of which would make it true. And this region is itself ill-defined: we cannot assign to it a definite boundary.

This is exactly what we meant before: “Precision” concerning logical propositions is not achievable as soon as we refer to symbols that we use. Only symbols that can’t be used are precise. There is only one sort of such symbols: transcendental symbols.

Mapping logics to language, as it happens so frequently and probably even as an acknowledged practice in linguistics in the treatment of vagueness, means to reduce language to logics. One changes the frame of reference, much like Zenon does in his self-generated pseudo-problems, much like Cantor1 [8] and his fellow Banach2 [9] did (in contrast to Dedekind3 [10]), or what Taylor4 did [11]. 3-dimensionality produces paradoxes in a 2-dimensional world, not only faulty projections. It is not really surprising that through the positivistic reduction of language to logics awkward paradoxes appear. Positivism implies violence, not only in the case linguistics.

We now can understand why it is almost silly to apply a truth-value-methodology to the analysis of language. The problem of vagueness is not a problem, it is deeply in the blueprint of “language” itself. It is almost trivial to make remarks as Russell did [3, p.87]:

The fact is that all words are attributable without doubt over a certain area, but become questionable within a penumbra, outside which they are again certainly not attributable.

And it really should be superfluous to cite this 90-year old piece. Quite remarkably it is not.

Language as a Practice

Wittgenstein emphasized repeatedly that language is a practice. Language is not a structure, so it is neither equivalent to logics nor to grammar, or even grammatology. In practices we need models for prediction or diagnosis, and we need rules, we frequently apply habits, which even may get symbolized.

Thus, we again may ask what is happening when we talk to each other. First, we exclude those models of which we now understand that they are not appropriate.

  • – Logics is incommensurable with language.
  • – Language, as well as any of its constituents, can’t be made “precise.”

As a consequence, language (and all of its constituents) is something that can’t be completely explicated. Large parts of language can only be demonstrated. Of course, we do not deny the proposal that a discourse reflects “propositional content,” as Brandom calls it ([5] chp. 8.6.2.). This propositional or conceptual content is given by the various kinds of models appearing in a discourse, models that are being built, inferred, refined, symbolized and finally externalized. As soon as we externalize a model, however, it is not language any more. We will investigate the dynamical route between concepts, logics and models in another chapter. Here and for the time being we may state that applying logics as a tool to language mistakes propositional content as propositional structure.

Again: What happens if I point to the white area up in the air before the blue background that we call sky, calling then “Oh, look a cloud!” ? Do I mean that there is an object called “cloud”? Even an object at all? No, definitely not. Claiming that there are “cloud-constituters,” that we do not measure exactly enough, that there is no proper thing we could call “cloud” (Unger), that our language has a defect etc., any purported “solution” of the problem [for an overview see 11] does not help to the slightest extent.

Anybody having made a mountain hike knows the fog in high altitudes. From lower regions, however, the same actual phenomenon is seen as a cloud. This provides us a hint, that the language game “cloud” also comprises information about the physical relational properties (position, speed, altitude) of the speaker.

What is going to happen by this utterance is that I invite my partner in discourse to interpret a particular, presumably shared sensory input and to interpret me and my interpretations as well. We may infer that the language game “cloud” contains a marker that is both linked to the structure and the semantics of the word, indicating that (1) there is an “object” without sharp borders, (2) no precise measurement should be performed. The symbolic value of “cloud” is such that there is no space for a different interpretation. Not the “object” is indicated by the word “cloud,” but a particular procedure, or class of procedures, that I as the primary speaker suggest when saying “Oh, there is a cloud.” By means of such procedures a particular style of modeling will be “induced” in my discourse partner, a particular way to actualize an operationalization, leading to such a representation of the signals from the external world that both partners are able to increase their mutual “understanding.” Yet, even “understanding” is not directed to the proposed object either. This scheme transparently describes the inner structure of what Charles S. Peirce called a “sign situation.” Neither accuracy, nor precision or vagueness are relevant dimensions in such kinds of mutually induced “activities,” which we may call a Peircean “sign.” They are completely secondary, a symptom of the use and of the openness.

Russell correctly proposes that all words in a language are vague. Yet, we would like to extend his proposal, by drawing on our image of thought that we explicate throughout all of our writings here. Elsewhere we already cited the Lagrangian trick in abstraction. Lagrange got aware about the power of a particular replacement operation: In a proposal or proposition, constants always can be replaced by appropriate procedures plus further constants. This increases generality and abstractness of the representation. Our proposal that is extending Russell’s insight is aligned to this scheme:

Different words are characterised (among other factors) by different procedures to select a particular class (mode) of interpretation.

Such procedures are precisely given as kind of models that are necessary besides those models implied in performing the interpretation of the actual phenomenon. The mode of interpretation comprises the selection of the scale employed in the operationalization, viz. measurement. Coarser scales imply a more profound underdetermination, a larger variety of possible and acceptable models, and a stronger feeling of vagueness.

Note that all these models are outside of language. To our opinion it does not make much sense to instantiate the model inside of language and then claiming a necessarily quite opaque “interpretation function,” as Burnett extensively demonstrates (if I understood her correctly). Our proposal is also more general (and more abstract) than Burnett’s, since we emphasize the procedural selection of interpretation models (note that models are not functions!). The necessary models for words like “taller,” “balder” or “cloudy” are not part of language and can’t be defined in terms of linguistic concepts. I would not call that a “cognitivist” stance, yet.  We conceive it just as a consequence of the transcendental status of models. This proposal is linked to two further issues. First, it implies the acceptance of the necessity of models as a condition. In turn, we have to clarify our attitude towards the philosophical concept of the condition. Second, it implies the necessity of an instantiation, the actualization of it as the move from the transcendental to the applicable, which in turn invokes further transcendental concepts, as we will argue and describe here.

Saying this we could add that models are not confined to “epistemological” affairs. As the relation between language (as a practice) and the “generalized” model shows, there is more in it than a kind of “generalized epistemology.” The generalization of epistemology can’t be conceived as a kind of epistemology at all, as we will argue in the chapter about the choreosteme. The particular relation between language and model as we have outlined it should also make clear that “models” are not limited to the categorization of observables in the outer world. It also applies—now in more classic terms—to the roots of what we can know without observation (e.g. Strawson, p.112 in [12]). It is not possible to act, to think, or to know without implying models, because it is not possible to act, to think or to know without transformation. This gives rise to model as a category and to the question of the ultimate conditionability of language, actions, or knowing. In our opinion, and in contrast to Strawson’s distinction, it is not appropriate to separate “knowledge from observations” and “knowledge without observation.” Insisting on such a separation immediately would also drop the insight about mutual dependency of models, concepts, symbols and signs, among many other things. In short, we would fall back directly into the mystic variant of idealism (cf. Frege’s hyper-platonism), implying also some “direct” link between language and idea. We rate such a disrespect of the body, matter and mediating associativity as inappropriate and of little value.

It would be quite interesting to conduct a comparative investigation of the conceptual life cycle of pictorial information in contrast to textual information along the line opened by such a “processual indicative.” Our guess is that the textual “word” may have a quite interesting visual counterpart. But we have to work on this later and elsewhere.

Our extension also leads to the conclusion that “vague” is not a logical “opposite” of “accurate,” or of “precise” either. Here we differ (not only) from Bertrand Russell’s position. So to speak, the vagueness of language applies here too. In our perspective, “accurate” simply symbolizes the indicative to choose a particular class of models that a speaker suggests to the partner in discourse to use. Nothing more, but also nothing less. Models can not be the “opposite” of other models. Words (or concepts) like “vague” or “accurate” just explicate the necessity of such a choice. Most of the words in a language refer only implicitly to that choice. Adjectives, whether absolute or relative, are bivalent with respect to the explicity or impliciteness of the choice of the procedure, just depending on the context.

For us it feels quite nice to discover a completely new property of words as they occur in natural languages. We call it “processual indicative.” A “word” without such a processual indicative on the structural level would not be a “word” any more. Either it reduces to a symbol, or even an index, or the context degenerates from a “natural” language (spoken and practiced in a community) into a formal language. The “processual indicative” of the language game “word” is a grammatical property (grammar here as philosophical grammar).

Nuisance, Flaws, and other Improprieties

Charles S. Peirce once mentioned, in a letter around 1908, that is well after his major works, and answering a question about the position or status of his own work, that he tends to label it as idealistic materialism. Notably, Peirce founded what is known today as American pragmatism. The idealistic note, as well as the reference to materialism, have to be taken extremely abstract in order to justify such. Of course, Peirce himself has been able for handling such abstract levels.

Usually, however, idealism and pragmatism are in a strong contradiction to each other. This is especially true when it comes to engineering, or more generally, to the problematics of the deviation, or the problematics posed by the deviation, if you prefer.

Obviously, linguistics is blind or even self-deceptive against their domain-specific “flaw,” the vagueness. Linguists are treating vagueness as a kind of flaw, or nuisance, at least as a kind of annoyance that needs to be overcome. As we already mentioned, there are many incommensurable proposals how to overcome it, except one: checking if it is a flaw at all, and which conditions or assumptions lead to the proposal that vagueness is indeed a flaw.

Taking only 1 step behind, it is quite obvious that logical positivism and its inheritance is the cause for the flaw. The problem “appeared” in the early 1960ies, when positivism was prevailing. Dropping the assumptions of positivism also removes the annoyance of vagueness.

Engineering a new device is a demanding task. Yet, there are two fundamentally different approaches. The first one, more idealistic in character, starts with an analytic representation, that is, a formula, or more likely, a system of formulas. Any influence that is not covered by that formula is either shifted into the premises, or into the so-called noise: influences, about nothing “could” be known, that drive the system under modeling into an unpredictable direction. Since this approach starts with a formula, that is, an analytic representation, we also can say that it starts under the assumption of representability, or identity. In fact, whenever you find designers, engineers or politicians experience to speak about “disturbances,” it is more than obvious that they follow the idealistic approach, which in turn follows a philosophy of identity.

The second approach is very different from the first one, since it does not start with identity. Instead, it starts with the acknowledgement of difference. Pragmatic engineering does not work instead of nuisances, it works precisely within and along nuisances. Thus, there is no such thing as a nuisance, a flaw, an annoyance, etc. There is just fluctuation. Instead of assuming the structural constant labeled as “ignorance,” as represented by the concept of noise, there is a procedure that is able to digest any fluctuation. A “disturbance” is nothing that can be observed as such. Quite in contrast, it is just and only a consequence of a particular selection of a purpose. Thus, pragmatic engineering leads to completely different structure that would be generated under idealistic assumptions. The difference between both remains largely invisible in all cases where the information part is neglectable (which actually is never the case), but it is vital to consider it in any context where formalization is dealing with information, whether it is linguistics or machine-learning.

The issue relates to “cognition” too, understood here as the naively and phenomenologically observable precipitation of epistemic conditions. From everyday experience, but also as a researcher in “cognitive sciences”, we know, i.e. we could agree on the proposal that cognition is something that is astonishing stable. The traditional structuralist view, as Smith & Jones call it [13], takes this stability as a starting point and as the target of the theory. The natural consequence is that this theory rests on the apriori assumption of a strict identifiability of observable items and of the result of cognitive acts, which are usually called concepts and knowledge. In other words, the idea that knowledge is about identifiable items is nothing else than a petitio principii: Since it serves as the underlying assumption it is no surprise that the result in the end exhibits the same quality. Yet, there is a (not so) little problem, as Smith & Jones correctly identified (p.184/185):

The structural approach pays less attention to variability (indeed, under a traditional approach, we design experiments to minimize variability) and not surprisingly, it does a poor job explaining the variability and context sensitivity of individual cognitive acts. This is a crucial flaw.  […]

Herein lies our discontent: If structures control what is constant about cognition, but if individual cognitive acts are smartly unique and adaptive to the context, structures cannot be the cause of the adaptiveness of individual cognitions. Why, then, are structures so theoretically important? If the intelligence-and the cause of real-time individual cognitive acts-is outside the constant structures, what is the value of postulating such structures?

The consequence the authors draw is to conceive cognition as process. They cite the work of Freeman [14] about the cognition of smelling

They found that different inhalants did not map to any single neuron or even group of neurons but rather to the spatial pattern of the amplitude of waves across the entire olfactory bulb.

The heir of being affected by naive phenomenology (phenomenology is always naive) and its main pillar of “identifiability of X as X” obviously leads to conclusions that are disastrous for the traditional theory. It vanishes.

Given these difficulties, positivists are trying to adapt. Yet, people still dream of semantic disambiguation as a mechanical technique, or likewise, dream (as Fregean worshipers) of eradicate vagueness from language by trying to explain it away.

One of the paradoxes dealt with over and over again is the already mentioned Sorites (Greek for “heap”) paradox. When is a heap a heap? Closely related to it are constructions like Wang’s Paradox [15]: If n is small, then n+1 is also small. Hence there is no number that s not small. How to deal with that?

Certainly, it does not help to invoke the famous “context dependency” as a potential cure. Jaegher and Rooij recently wrote [16]:

“If, as suggested by the Sorites paradox, ne-grainedness is important, then a vague language should not be used. Once vague language is used in an appropriate context, standard axioms of rational behaviour are no longer violated.”

Yet, what could appropriate mean? Actually, for an endeavor as Jaegher and Rooij have been starting the appropriateness needs to be determined by some means that could not be affected by vagueness. But how to do that for language items? They continue:

“The rationale for vagueness here is that vague predicates allow players to express their valuations, without necessarily uttering the context, so that the advantage of vague predicates is that they can be expressed across contexts.”

At first sight, this seems plausible. Now, any part of language can be used in any context, so all the language is vague. The unfavorable consequence for Jaegher & Rooij being that their attempt is not even a self-disorganizing argument, it has the unique power of being self-vanishing, their endeavor of expelling vagueness is doomed to fail before they even started. Their main failure is, however, that they take the apriori assumption for granted that vagueness and crispness are “real” entities that are somehow existing before any perception, such that language could be “infected” or affected with it. Note that this is not a statement about linguistics, it is one about philosophical grammar.

It also does not help to insist on “tolerance”. Rooij [17] recently mentioned that “vagueness is crucially related with tolerant interpretation”. Rooij desperately tries to hide his problem, the expression “tolerant interpretation” is almost completely empty. What should it mean to interpret something tolerantly as X? Not as X? Also a bit as Y? How then would we exchange ideas and how could it be that we agree exactly on something? The problem is just move around a corner, but not addressed in any reasonable manner. Yet, there is a second objection to “tolerant interpretation”.

Interpretation of vague terms by a single entity must always fail. What is needed are TWO interpretations that are played as negotiation in language games. Two entities, whether humans or machines, have to agree, i.e. they also have to be able to perform the act of agreeing,  in order to resolve vagueness of items in language. It is better to drop vagueness all together and simply to say that at least two entities are necessarily be “present” to play a language game. This “presence” is , of course, an abstract semiotic one. It is given in any Peircean sign situation. Since signs refer only and always just to other signs vagueness is, in other words, not a difficulty that need to be “tolerated”.

Dummett [15] spent more than 20 pages for the examination of the problem of vagueness. Up to date it is one of the most thorough ones, but unfortunately not received or recognized by contemporary linguistics. There is still a debate about it, but no further development of it. Dummett essentially proofs that vagueness is a not a defect of language, it is a “design feature”. First, he proposes a new logical operator “definitely” in order to deal with the particular quality of indeterminateness of language. Yet, it does not remove vagueness or its problematic, “that is, the boundaries between which acceptable sharpenings of a statement or a predicate range are themselves indefinite.” (p.311)

He concludes that “vague predicates are indispensable”, they are not eliminable in principle without loosing language itself. Tolerance does not help as much selecting “appropriate contexts” fails to do, both proposed to get rid of a problem. What linguists propose (at least those adhering to positivism, i.e. nowadays nearly all of them) is to “carry around a colour-chart, as Wittgenstein suggested in one
of his example” (Dummett). This would turn observational terms into legitimated ones by definition. Of course, the “problem” of vagueness would vanish, but along with it also any possibility to speak and to live. (Any apparent similarity to real persons, politicians or organizations such like the E.U. is indeed intended.)

Linguistics, and cognitive sciences as well, will fail to provide any valuable contribution as long as they apply the basic condition of the positivist attitude: that subjects could be separated from each other in order to understand the whole. The whole here is the Lebensform working underneath, or beyond (Foucault’s field of proposals, Deleuze’s sediments), connected cognitions. It is almost ridicule to try to explain anything regarding language within the assumption of identifiability and applicability of logics.

Smith and Jones close their valuable contribution with the following statement, abandoning the naive realism-idealism that has been exhibited so eloquently by Rooij and his co-workers nearly 20 years later:

On a second level, we questioned the theoretical framework-the founding assumptions-that underlie the attempt to define what “concepts really are.” We believe that the data on developing novel word interpretations-data showing the creative intelligence of dynamic cognition-seriously challenge the view of cognition as represented knowledge structures. These results suggest that perception always matters in a deep win: Perception always matters because cognition is always adaptive to the here-and-now, and perception is our only means of contact with the here-and now reality.

There are a number of interesting corollaries here, which we will not follow here. For instance, it would be a categorical mistake to talk about noise in complex systems. Another consequence is that engineering, linguistics or philosophy that is based on the apriori concept of identity is not able to make reasonable proposals about evolving and developing systems, quite in contrast to a philosophy that starts with difference (as a transcendental category, see Deleuze’s work, particularly [18]).

We now can understand that idealistic engineering is imposing its adjudgements ways too early. Consequently, idealistic engineering is committing the naturalistic fallacy in the same way as many linguistics is committing it, at least as far as the latter starts with the positivistic assumption of the possibility of positive assumptions such as identifiability. The conclusion for the engineering of machine-based episteme is quite obvious: we could not start with identified or even identifiable items, and where it seems that we meet them, as in the case of words, we have to take their identifiability as a delusion or illusion. We also could say that the only feasible input for a machine that is supposed to “learn” is made from vague items for which there is only a probabilistic description. Even more radical, we can see that without fundamentally embracing vagueness no learning is possible at all. That’s now the real reason for the failure of “strong” or “symbolic” AI.

Conclusions for Machine-based Epistemology

We started with a close inspection and a critique of the concept of vagueness and ended up in a contribution to the theory of language. Once again we see that language is not just about words, symbols and grammar. There is much more in it and about it that we must understand to bring language into contact with (brain) matter.

Our results clearly indicate, against the mainstream in linguistics and large parts of (mainly analytic) philosophy, that words can’t be conceived as parts of predicates, i.e. clear proposals, and language can’t be used as a vehicle for the latter. This again justifies an initial probabilistic representation of those grouped graphemes (phonemes) as they can be taken from a text, and which we call “words.” Of course, the transition from a probabilistic representation to the illusion of propositions is not a trivial one. Yet, it is not words that we can see in the text, it is just graphemes. We will investigate the role and nature of words at some later point in time (“Waves, Words, and Images”, forthcoming).

Secondly we discovered a novel property or constituent of words, which is a selection function (or a class thereof) which indicates the style of interpretation regarding the implied style of presumed measurement. We called it processual indicative. Such a selection results in the invoking of clear-cut relations or boundaries, or indeterminable ones. Implementing the understanding of language necessarily has to implement such a property for all of the words. In any of the approaches known so far, this function is non-existent, leading to serious paradoxes and disabilities.

A quite nice corollary of these results is that words never could be taken as a reference. It is perhaps more appropriate to conceive of words as symbols for procedural packages, recipes and prescription on how to arrange certain groups of models. Taken such, van Fraassen’s question on how words acquire reference is itself based on a drastic misunderstanding, deeply informed by positivism (remember that it was van Fraassen who invented this weird thing called supervaluationism). There is no such “reference.” Instead, we propose to conceive of words as units consisting from (visible) symbols and a “Lagrangean” differential part. This new conception of words remains completely compatible with Wittgenstein’s view on language as a communal practice; yet, it avoids some difficulties, Wittgenstein has struggled with throughout his life. The core of it may be found in PI §201, describing the paradox of rule following. For us, this paradox simply vanishes. Our model of words as symbolic carriers of “processual indicatives” also sheds light to what Charles S. Peirce called a “sign situation,” being not able to elucidate the structure of “signs” any further. Our inferentialist scheme lucidly describes the role of the symbolic as a quasi-material anchor, from which we can proceed via models as targets of the “processual indicative” to the meaning as a mutually ascribed resonance.

The introduction of the “processual indicative” also allows to understand the phenomenon that despite the vagueness of words and concepts it is possible to achieve very precise descriptions. The precision, however, is just a “feeling” as it is the case for “vagueness,” dependent on a particular discursive situation. Larger amounts of “social” rules that can be invoked to satisfy the “processual indicative” allow for more precise statements. If, however, these rules are indeterminate by themselves quite often more or less funny situation may occur (or disastrous misunderstandings as well).

The main conclusion, finally, is referring to the social aspect of a discourse. It is largely unknown how two “epistemic machines” will perceive, conceive of and act upon each other. Early experiments by Luc Steels involved mini robots that have been far too primitive to draw any valuable conclusion for our endeavor. And Stanislav Lem’s short story “Personetics”[19] does not contain any hint about implementational issues… Thus, we first have to implement it…

Notes

1. One of Cantor’s paradoxes claims that a 2-dimensional space can be mapped entirely onto a 1 dimensional space without projection errors, or overlaps. All of Cantor’s work is “absurd,” since it mixes two games that apriori have been separated: countability and non-countability. The dimensions paradox appears because Cantor conceives of real numbers as determinable, hence countable entities. However, by his own definition via the Cantor triangle, real numbers are supra-countable infinite. Real numbers are not determinable, hence they can’t be “re-ordered,” or put along a 1-dimensional line. Its a “silly” contradiction. We conclude that such paradoxes are pseudo-paradoxes.

2. The Banach-Tarski (BT) pseudo-paradox is of the same structure as the dimensional pseudo-paradox of Cantor. The surface of a sphere is broken apart into a finite number of “individual” pieces; yet , those pieces are not of determinate shape. Then BT proof that from the pieces of 1 sphere 2 spheres can be created. No surprise at all: the pieces are not of determinate shape, they are complicated: they are not usual solids but infinite scatterings of points. It is “silly” first to speak about pieces of a sphere, but then to dissolve those pieces  into Cantor dust. Countability and incountability collide. Thus there is no coherence, so they can be any. The BT paradox is even wrong: from such premises an infinite number of balls could be created from a single ball, not just a second one.

3. Dedekind derives natural numbers as actualizations from their abstract uncountable differentials, the real numbers.

4. Taylor’s paradox brings scales into conflict. A switch is toggled repeatedly after a decreasing period of time, such that the next period is just half of the size of the current one. After n toggling events (n>>), what is the state of the switch? Mathematically, it is not defined (1 AND 0), statistically it is 1/2. Again, countability, which implies a physical act, ultimately limited by the speed of light, is contrasted by infinitely small quantities, i.e. incountability. According to Gödel’s incompleteness, for any formal system it is possible to construct paradoxes by putting up “silly” games, which do not obey to the self-imposed apriori assumptions.

This article has been created on Dec 29th, 2011, and has been republished in a considerably revised form on March 23th, 2012.

References

  • [1] Heather Burnett, The Puzzle(s) of Absolute Adjectives – On Vagueness, Comparison, and the Origin of Scale Structure. Denis Paperno (ed). “UCLA Working Papers in Semantics,” 2011; version referred to is from 20.12.2011. available online.
  • [2] Brian Weatherson (2009), The Problem of the Many. Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. available online, last access 28.12.2011.
  • [3] Peter Unger (1980), The Problem of the Many.
  • [4] Bertrand Russell (1923): Vagueness, Australasian Journal of Psychology and Philosophy, 1(2), 84-92.
  • [5] Robert Brandom, Making it Explicit. 1994.
  • [6] John Austin. Speech act Theory.
  • [7] Colin Johnston (2009). Tractarian objects and logical categories. Synthese 167: 145-161.
  • [8] Cantor
  • [9] Banach
  • [10] Dedekind
  • [11] Taylor
  • [12] Peter Strawson, Individuals: An Essay in Descriptive Metaphysics. Methuen, London 1959.
  • [13] Linda B. Smith, Susan S. Jones (1993). Cognition Without Concepts. Cognitive Development, 8, 181-188. available here.
  • [14] Freeman, W.J. (1991). The physiology of perception. Scientific American. 264. 78-85.
  • [15] Michael Dummett, Wang’s Paradox (1975). Synthese 30 (1975) 301-324. available here.
  • [16] Kris De Jaegher, Robert van Rooij (2011). Strategic Vagueness, and appropriate contexts. Language, Games, and Evolution, Lecture Notes in Computer Science, 2011, Volume 6207/2011, 40-59, DOI: 10.1007/978-3-642-18006-4_3
  • [17] Robert van Rooij (2011). Vagueness, tolerance and non-transitive entailment in Understanding Vagueness – Logical, Philosophical and Linguistic Perspectives, Petr Cintula, Christian Fermuller, Lluis Godo, Petr Hajek (eds.), College Publications, 2011.
  • [18] Gilles Deleuze, Difference and Repetition.
  • [19] Stanislav Lem, Personetics. reprinted in: Douglas Hofstadter, The Minds I.

۞

Where Am I?

You are currently browsing entries tagged with similarity at The "Putnam Program".