Prolegomena to a Morphology of Experience
May 2, 2012 § Leave a comment
Experience is a fundamental experience.
The very fact of this sentence demonstrates that experience differs from perception, much like phenomena are different from objects. It also demonstrates that there can’t be an analytic treatment or even solution of the question of experience. Experience is not only related to sensual impressions, but also to affects, activity, attention1 and associations. Above all, experience is deeply linked to the impossibility to know anything for sure or, likewise, apriori. This insight is etymologically woven into the word itself: in Greek, “peria” means “trial, attempt, experience”, influencing also the roots of “experiment” or “peril”.
In this essay we will focus on some technical aspects that are underlying the capability to experience. Before we go in medias res, I have to make clear the rationale for doing so, since, quite obviously so, experience could not be reduced to those said technical aspects, to which for instance modeling belongs. Experience is more than the techné of sorting things out [1] and even more than the techné of the genesis of discernability, but at the same time it plays a particular, if not foundational role in and for the epistemic process, its choreostemic embedding and their social practices.
Epistemic Modeling
As usual, we take the primacy of interpretation as one of transcendental conditions, that is, it is a condition we can‘t go beyond, even on the „purely“ material level. As a suitable operationalization of this principle, still a quite abstract one and hence calling for situative instantiation, we chose the abstract model. In the epistemic practice, the modeling does not, indeed, even never could refer to data that is supposed to „reflect“ an external reality. If we perform modeling as a pure technique, we are just modeling, but creating a model for whatsoever purpose, so to speak „modeling as such“, or purposed modeling, is not sufficient to establish an epistemic act, which would include the choice of the purpose and the choice of the risk attitude. Such a reduction is typical for functionalism, or positions that claim a principle computability of epistemic autonomy, as for instance the computational theory of mind does.
Quite in contrast, purposed modeling in epistemic individuals already presupposes the transition from probabilistic impressions to propositional, or say, at least symbolic representation. Without performing this transition from potential signals, that is mediated „raw“ physical fluctuations in the density of probabilities, to the symbolic it is impossible to create a structure, let it be for instance a feature vector as a set of variably assigned properties, „assignates“, as we called it previously. Such a minimal structure, however, is mandatory for purposed modeling. Any (re)presentation of observations to a modeling methods thus is already subsequent to prior interpretational steps.
Our abstract model that serves as an operationalization of the transcendental principle of the primacy of interpretation thus must also provide, or comprise, the transition from differences into protosymbols. Protosymbols are not just intensions or classes, they are so to speak nonempiric classes that have been derived from empiric ones by means of idealization. Protosymbols are developed into symbols by means of the combination of naming and an associated practice, i.e a repeating or reproducible performance, or still in other words, by rulefollowing. Only on the level of symbols we then may establish a logic, or claiming absolute identity. Here we also meet the reason for the fact that in any realworld context a “pure” logic is not possible, as there are always semantic parts serving as a foundation of its application. Speaking about “truthvalues” or “truthfunctions” is meaningless, at least. Clearly, identity as a logical form is a secondary quality and thus quite irrelevant for the booting of the capability of experience. Such extended modeling is, of course, not just a single instance, it is itself a multileveled thing. It even starts with the those properties of the material arrangement known as body that allow also an informational perspective. The most prominent candidate principle of such a structure is the probabilistic, associative network.
Epistemic modeling thus consists of at least two abstract layers: First, the associative storage of random contexts (see also the chapter “Context” for their generalization), where no purpose is implied onto the materially preprocessed signals, and second, the purposed modeling. I am deeply convinced that such a structure is only way to evade the fallacy of representationalism2. A working actualization of this abstract bilayer structure may comprise many layers and modules.
Yet, once one accepts the primacy of interpretation, and there is little to say against it, if anything at all, then we are lead directly to epistemic modeling as a mandatory constituent of any interpretive relationship to the world, for primitive operations as well as for the rather complex mental life we experience as humans, with regard to our relationships to the environment as well as with regard to our inner reality. Wittgenstein emphasized in his critical solipsism that the conception of reality as inner reality is the only reasonable one [3]. Epistemic modeling is the only way to keep meaningful contact with the external surrounds.
The Bridge
In its technical parts experience is based on an actualization of epistemic modeling. Later we will investigate the role and the usage of these technical parts in detail. Yet, the gap between modeling, even if conceived as an abstract, epistemic modeling, and experience is so large that we first have to shed some light on the bridge between these concepts. There are some other issues with experience than just the mere technical issues of modeling that are not less relevant for the technical issues, too.
Experience comprises both more active and more passive aspects, both with regard to performance and to structure. Both dichotomies must not be taken as ideally separated categories, of course. Else, the basic distinction into active and passive parts is not a new one either. Kant distinguished receptivity and spontaneity as two complementary faculties that combine in order to bring about what we call cognition. Yet, Leibniz, in contrast, emphasized the necessity of activity even in basic perception; nowadays, his view has been greatly confirmed by the research on sensing in organic (animals) as well as in inorganic systems (robots). Obviously, the relation between activity and passivity is not a simple one, as soon as we are going to leave the bright spheres of language.3
In the structural perspective, experience unfolds in a given space that we could call the space of experiencibility4. That space is spanned, shaped and structured by open and dynamic collections of any kind of theory, model, concept or symbol as well as by the mediality that is “embedding” those. Yet, experience also shapes this space itself. The situation reminds a bit to the relativistic space in physics, or the social space in humans, where the embedding of one space into another one will affect both participants, the embedded as well as the embedding space. These aspects we should keep in mind for our investigation of questions about the mechanisms that contribute to experience and the experience of experience. As you can see, we again refute any kind of ontological stances even to their smallest degrees.5
Now when going to ask about experience and its genesis, there are two characteristics of experience that enforce us to avoid the direct path. First, there is the deep linkage of experience to language. We must get rid of language for our investigation in order to avoid the experience of finding just language behind the language or what we call upfront “experience”; yet, we also should not forget about language either. Second, there is the selfreferentiality of the concept of experience, which actually renders it into a strongly singular term. Once there are even only tiny traces of the capability for experience, the whole game changes, burying the initial roots and mechanisms that are necessary for the booting of the capability.
Thus, our first move consists in a reduction and linearization, which we have to catch up with later again, of course. We will achieve that by setting everything into motion, sotospeak. The linearized question thus is heading towards the underlying mechanisms6:
How do we come to believe that there are facts in the world? 7
What are—now viewed from the outside of language8—the abstract conditions and the practiced moves necessary and sufficient for the actualization of such statements?
Usually, the answer will refer to some kind of modeling. Modeling provides the possibility for the transition from the extensional epistemic level of particulars to the intensional epistemic level of classes, functions or categories. Yet, modeling does not provide sufficient reason for experience. Sure, modeling is necessary for it, but it is more closely related to perception, though also not being equivalent to it. Experience as a kind of cognition thus can’t be conceived as kind of a “highlevel perception”, quite contrary to the suggestion of Douglas Hofstadter [4]. Instead, we may conceive experience, in a first step, as the result and the activity around the handling of the conditions of modeling.
Even in his earliest writings, Wittgenstein prominently emphasized that it is meaningless to conceive of the world as consisting from “objects”. The Tractatus starts with the proposition:
The world is everything that is the case.
Cases, in the Tractatus, are states of affairs that could be made explicit into a particular (logical) form by means of language. From this perspective one could derive the radical conclusion that without language there is no experience at all. Despite we won’t agree to such a thesis, language is a major factor contributing to some often unrecognized puzzles regarding experience. Let us very briefly return to the issue of language.
Language establishes its own space of experiencibility, basically through its unlimited expressibility that induces hermeneutic relationships. Probably mainly to this particular experiential sphere language is blurring or even blocking clear sight to the basic aspects of experience. Language can make us believe that there are phenomena as some kind of original stuff, existing “independently” out there, that is, outside the human cognition.9 Yet, there is no such thing like a phenomenon or even an object that would “be” before experience, and for us humans even not before or outside of language. It is even not reasonable to speak about phenomena or objects as if they would exist before experience. De facto, it is almost nonsensical to do so.
Both, objects as specified entities and phenomena at large are consequences of interpretation, in turn deeply shaped by cultural imprinting, and thus heavily depending on language. Refuting that consequence would mean to refute the primacy of interpretation, which would fall into one of the categories of either naive realism or mysticism. Phenomenology as an ontological philosophical discipline is nothing but a misunderstanding (as ontology is henceforth); since phenomenology without ontological parts must turn into some kind of Wittgensteinian philosophy of language, it simply vanishes. Indeed, when already being teaching in Cambridge, Wittgenstein once told a friend to report his position to the visiting Schlick, whom he refused to meet on this occasion, as “You could say of my work that it is phenomenology.” [5] Yet, what Wittgenstein called “phenomenology” is completely situated inside language and its practicing, and despite there might be a weak Kantian echo in his work, he never supported Husserl’s position of synthetic universals apriori. There is even some likelihood that Wittgenstein, strongly feeling to be constantly misunderstood by the members of the Vienna Circle, put this forward in order to annoy Schlick (a bit), at least to pay him back in kind.
Quite in contrast, in a Wittgensteinian perspective facts are sort of collectively compressed beliefs about relations. If everybody believes to a certain model of whatever reference and of almost arbitrary expectability, then there is a fact. This does not mean, however, that we get drowned by relativism. There are still the constraints implied by the (unmeasured and unmeasurable) utility of anticipation, both in its individual and its collective flavor. On the other hand, yes, this indeed means that the (social) future is not determined.
More accurately, there is at least one fact, since the primacy of interpretation generates at least the collectivity as a further fact. Since facts are taking place in language, they do not just “consist” of content (please excuse such awful wording), there is also a pragmatics, and hence there are also at least two different grammars, etc.etc.
How do we, then, individually construct concepts that we share as facts? Even if we would need the mediation by a collective, a large deal of the associative work takes place in our minds. Facts are identifiable, thus distinguishable and enumerable. Facts are almost digitized entities, they are constructed from percepts through a process of intensionalization or even idealization and they sit on the verge of the realm of symbols.
Facts are facts because they are considered as being valid, let it be among a collective of people, across some period of time, or a range of material conditions. This way they turn into kind of an apriori from the perspective of the individual, and there is only that perspective. Here we find the locus situs of several related misunderstandings, such as direct realism, Husserlean phenomenology, positivism, the thing as such, and so on. The fact is even synthetic, either by means of “individual”10 mental processes or by the working of a “collective reasoning”. But, of course, it is by no means universal, as Kant concluded on the basis of Newtonian science, or even as Schlick did in 1930 [6]. There is neither a universal real fact, nor a particular one. It does not make sense to conceive the world as existing from independent objects.
As a consequence, when speaking about facts we usually studiously avoid the fact of risk. Participants in the “fact game” implicitly agree on the abandonment of negotiating affairs of risk. Despite the fact that empiric knowledge never can be considered as being “safe” or “secured”, during the fact game we always behave as if. Doing so is the more or less hidden work of language, which removes the risk (associated with predictive modeling) and replaces it by metaphorical expressibility. Interestingly, here we also meet the source field of logic. It is obvious (see Waves & Words) that language is neither an extension of logics, nor is it reasonable to consider it as a vehicle for logic, i.e. for predicates. Quite to the contrast, the underlying hypothesis is that (practicing) language and (weaving) metaphors is the same thing.11 Such a language becomes a living language that (as Gier writes [5])
“[…] grows up as a natural extension of primitive behavior, and we can count on it most of the time, not for the univocal meanings that philosophers demand, but for ordinary certainty and communication.”
One might just modify Gier’s statement a bit by specifying „philosophers“ as idealistic, materialistic or analytic philosophers.
In “On Certainty” (OC, §359), Wittgenstein speaks of language as expressing primitive behavior and contends that ordinary certainty is “something animal”. This now we may take as a bridge that provides the possibility to extend our asking about concepts and facts towards the investigation of the role of models.
Related to this there is a pragmatist aspect that is worthwhile to be mentioned. Experience is a historicizing concept, much like knowledge. Both concepts are meaningful only in hindsight. As soon as we consider their application, we see that both of them refer only to one half of the story that is about the epistemic aspects of „life“. The other half of the epistemic story and directly implied by the inevitable need to anticipate is predictive or, equivalently, diagnostic modeling. Abstract modeling in turn implies theory, interpretation and orthoregulated rulefollowing.
Epistemology thus should not be limited to „knowledge“, the knowable and its conditions. Epistemology has explicitly to include the investigation of the conditions of what can be anticipated.
In a still different way we thus may repose the question about experience as the transition from epistemic abstract modeling to the conditions of that modeling. This would include the instantiation of practicable models as well as the conditions for that instantiation, and also the conditions of the application of models.In technical terms this transition is represented by a problematic field: The model selection problem, or in more pragmatic terms, the model (selection) risk.
These two issues, the prediction task and the condition of modeling now form the second toehold of our bridge between the general concept of experience and some technical aspects of the use of models. There is another bridge necessary to establish the possibility of experience, and this one connects the concept of experience with languagability.
The following list provides an overview about the following chapters:
 The Modeling Statement
 Predictability and Predictivity
 The Independence Assumption
 The Model Selection Problem
 Describing Classifiers
 Observations and Probabilities
 The Result of Modeling
These topics are closely related to each other, indeed so closely that other sequences would be justifiable too. Their interdependencies also demand a bit of patience from you, the reader, as the picture will be complete only when we arrive at the results of modeling.
A last remark may be allowed before we start to delve into these topics. It should be clear by now that any kind of phenomenology is deeply incompatible with the view developed here. There are several related stances, e.g. the various shades of ontology, including the objectivist conception of substance. They are all rendered as irrelevant and inappropriate for any theory about episteme, whether in its machinebased form or regarding human culture, whether as practice or as reflecting exercise.
The Modeling Statement
As the very first step we have to clearly state the goal of modeling. From the outside that goal is pretty clear. Given a set of observations and the respective outcomes, or targets, create a mapping function such that the observed data allow for a reconstruction of the outcome in an optimized manner. Finding such a function can be considered as a simple form of learning if the function is „invented“. In most cases it is not learning but just the estimation of predefined parameters.12 In a more general manner we also could say that any learning algorithm is a map L from data sets to a ranked list of hypothesis functions. Note that accuracy is only one of the possible aspects of that optimization. Let us call this for convenience the „outer goal“ of modeling. Would such mapping be perfect within reasonable boundaries, we would have found automatically a possible transition from probabilistic presentation to propositional representation. We could consider the induction of a structural description from observations as completed. So far the secret dream of Hans Reichenbach, Carl SchmidHempel, Wesley Salmon and many of their colleagues.
The said mapping function will never be perfect. The reasons for this comprise the complexity of the subject, noise in the measured data, unsuitable observables or any combinations of these. This induces a wealth of necessary steps and, of course, a lot of work. In other words, a considerable amount of apriori and heuristic choices have to be taken. Since a reliable, say analytic mapping can’t be found, every single step in the value chain towards the model at once becomes questionable and has to be checked for its suitability and reliability. It is also clear that the model does not comprise just a formula. In realworld situations a differential modeling should be performed, much like in medicine a diagnosis is considered to be complete only if a differential diagnosis is included. This comprises the investigation of the influence of the method’s parameterization onto the results. Let us call the whole bunch of respective goals the „inner goals“ of modeling.
So, being faced with the challenge of such empirical mess, how does the statement about the goals of the „inner modeling“ look like? We could for instance demand to remove the effects of the shortfalls mentioned above, which cause the imperfect mapping: complexity of the subject, noise in the measured data, or unsuitable observables.
To make this more concrete we could say, that the inner goals of modeling consist in a twofold (and thus synchronous!) segmentation of the data, resulting in the selection of the proper variables and in the selection of the proper records, where this segmentation is performed under conditions of a preceding nonlinear transformation of the embedding reference system. Ideally, the model identifies the data for which it is applicable. Only for those data then a classification is provided. It is pretty clear that this statement is an ambitious one. Yet, we regard it as crucial for any attempt to step across our epistemic bridge that brings us from particular data to the quality of experience. This transition includes something that is probably better known by the label „induction“. Thus, we finally arrive at a short statement about the inner goals of modeling:
How to conclude and what to conclude from measured data?
Obviously, if our data are noisy and if our data include irrelevant values any further conclusion will be unreliable. Yet, for any suitable segmentation of the data we need a model first. From this directly follows that a suitable procedure for modeling can’t consist just from a single algorithm, or a „oneshot procedure“. Any instance of singlestep approaches are suffering from lots of hidden assumptions that influence the results and its properties in unforeseeable ways. Modeling that could be regarded as more than just an estimation of parameters by running an algorithm is necessarily a circular and—dependent on the amount of variables—possibly openended process.
Predictability and Predictivity
Let us assume a set of observations S obtained from an empirical process P. Then this process P should be called “predictable” if the results of the mapping function f(m) that serves as an instance of a hypothesis h from the space of hypotheses H coincides with the outcome of the process P in such a way that f(m) forms an expectation with a deviation d<ε for all f(m). In this case we may say that f(m) predicts P. This deviation is also called “empirical risk”, and the purpose of modeling is often regarded as minimizing the empirical risk (ERM).
There are then two important questions. Firstly, can we trust f(m), since f(m) has been built on a limited number of observations? Secondly, how can we make f(m) more trustworthy, given the limitation regarding the data? Usually, these questions are handled under the label of validation. Yet, validation procedures are not the only possible means to get an answer here. It would be a misunderstanding to think that it is the building or construction of a model that is problematic.
The first question can be answered only by considering different models. For obtaining a set of different models we could apply different methods. That would be o.k. if prediction would be our sole interest. Yet, we also strive for detecting structural insights. And from that perspective we should not, of course, use different methods to get different models. The second possibility for addressing the first question is to use different subsamples, which turns simple validation into a crossvalidation. Crossvalidation provides an expectation for the error (or the risk). Yet, in order to compare across methods one actually should describe the expected decrease in “predictive power”13 for different sample sizes (independent crossvalidation per sample size). The third possibility for answering question (1) is related to the the former and consists by adding noised, surrogated (or simulated) data. This prevents the learning mechanism from responding to empirically consistent, but nevertheless irrelevant noisy fluctuations in the raw data set. The fourth possibility is to look for models of equivalent predictive power, which are, however, based on a different set of predicting variables. This possibility is not accessible for most statistical approaches such like Principal Component Analysis (PCA). Whatever method is used to create different models, models may be combined into a “bag” of models (called “bagging”), or, following an even more radical approach, into an ensemble of small and simple models. This is employed for instance in the socalled Random Forest method.
Commonly, if a model passes crossvalidation successfully, it is considered to be able to “generalize”. In contrast to the common practice, Poggio et al. [7] demonstrated that standard crossvalidation has to be extended in order to provide a characterization of the capability of a model to generalize. They propose to augment
CV_{1oo} stability with stability of the expected error and stability of the empirical error to define a new notion of stability, CVEEE_{1oo} stability.
This makes clear that Poggio’s et al. approach is addressing the learning machinery, not any longer just the space of hypotheses. Yet, they do not take the free parameters of the method into account. We conclude that their proposed approach still remains an uncritical approach. Thus I would consider such a model as not completely trustworthy. Of course, Poggio et al. are definitely pointing towards the right direction. We recognize a move away from naive realism and positivism, instead towards a critical methodology of the conditional. Maybe, philosophy and natural sciences find common grounds again by riding the information tiger.
Checking the stability of the learning procedure leads to a methodology that we called “data experiments” elsewhere. The data experiments do NOT explore the space of hypotheses, at least not directly. Instead they create a map for all possible models. In other words, instead of just asking about the predictability we now ask about the differential predictivity of in the space of models.
From the perspective of a learning theory Poggio’s move can’t be underestimated. Statistical learning theory (SLT)[8] explicitly assumes that a direct access to the world is possible (via identity function, perfectness of the model). Consequently, SLT focuses (only) on the reduction of the empirical risk. Any learning mechanism following the SLT is hence uncritical about its own limitation. SLT is interested in the predictability of the systemassuch, thereby not rather surprisingly committing the mistake of pre19th century idealism.
The Independence Assumption
The independence assumption [I.A.], or linearity assumption, acts mainly on three different targets. The first of them is the relationship between observer and observed, while its second target is the relationship between observables. The third target finally regards the relation between individual observations. This last aspect of the I.A. is the least problematic one. We will not discuss this any further.
Yet, the first and the second one are the problematic ones. The I.A. is deeply buried into the framework of statistics and from there it made its way into the field of explorative data analysis. There it can be frequently met for instance in the geometrical operationalization of similarity, the conceptualization of observables as Cartesian dimensions or independent coefficients in systems of linear equations, or as statistical kernels in algorithms like the Support Vector Machine.
Of course, the I.A. is just one possible stance towards the treatment of observables. Yet, taking it as an assumption we will not include any parameter into the model that reflects the dependency between observables. Hence, we will never detect the most suitable hypothesis about the dependency between observables. Instead of assuming the independence of variables throughout an analysis it would be methodological much more sound to address the degree of dependency as a target. Linearity should not be an assumption, it should be a result of an analysis.
The linearity or independence assumption carries another assumption with it under its hood: the assumption of the homogeneity of variables. Variables, or assignates, are conceived as blackboxes, with unknown influence onto the predictive power of the model. Yet, usually they exert very different effects on the predictive power of a model.
Basically, it is very simple. The predictive power of a model depends on the positive predictive value AND the negative predictive value, of course; we may also use closely related terms sensitivity and specificity. Accordingly, some variables contribute more to the positive predictive value, other help to increase the negative predictive value. This easily becomes visible if we perform a detailed typeI/II error analysis. Thus, there is NO way to avoid testing those combinations explicitly, even if we assume the initial independence of variables.
As we already mentioned above, the I.A. is just one possible stance towards the treatment of observables. Yet, its status as a methodological sine qua non that additionally is never reflected upon renders it into a metaphysical assumption. It is in fact an irrational assumption, which induces serious costs in terms of the structural richness of the results. Taken together, the independence assumption represents one of the most harmful habits in data analysis.
The Model Selection Problem
In the section “Predictability and Predictivity” above we already emphasized the importance of the switch from the space of hypotheses to the space of models. The model space unfolds as a condition of the available assignates, the size of the data set and the free parameters of the associative (“modeling”) method. The model space supports a fundamental change of the attitude towards a model. Based on the denial of the apriori assumption of independence of observables we identified the idea of a singular best model as an illposed phantasm. We thus move onwards from the concept of a model as a mapping function towards ensembles of structurally heterogeneous models that together as a distinguished population form a habitat, a manifold in the sphere of the model space. With such a structure we neither need to arrive at a single model.
Methods, Models, Variables
The model selection problem addresses two sets of parameters that are actually quite different from each other. Model selection should not be reduced to the treatment of the first set, of course, as it happens at least implicitly for instance in [9]. The first set refers to the variables as known from the data, sometimes also called the „predictors“. The selection of the suitable variables is the first half of the model selection problem. The second set comprises all free parameters of the method. From the methodological point of view, this second set is much more interesting than the first one. The method’s parameters are apriori conditions to the performance of the method, which additionally usually remain invisible in the results, in contrast to the selection of variables.
For associative methods like SOM or other clustering variables the effect of de/selecting variables can be easily described. Just take all the objects in front of you, for instance on the table, or in your room. Now select an arbitrary purpose and assign this purpose as a degree of support to those objects. For now, we have constructed the target. Now we go “into” the objects, that is, we describe them by a range of attributes that are present in most of the objects. Dependent on the selection of a subset from these attributes we will arrive at very different groups. The groups now represent the target more or less, that’s the quality of the model. Obviously, this quality differs across the various selections of attributes. It is also clear that it does not help to just use all attributes, because some of the attributes just destroy the intended order, they add noise to the model and decrease its quality.
As George observes [10], since its first formulation in the 1960ies a considerable, if not large number of proposals for dealing with the variable selection problem have been proposed. Although George himself seem to distinguish the two sets of parameters, throughout the discussion of the different approaches he always refers just to the first set, the variables as included in the data. This is not a failure of the said author, but a problem of the statistical approach. Usually, the parameters of statistical procedures are not accessible, as any analytic procedure, they work as they work. In contrast to Selforganizing Maps, and even to Artificial Neural Networks (ANN) or Genetic Procedures, analytic procedures can’t be modified in order to achieve a critical usage. In some way, with their monobloc design they perfectly fit into representationalist fallacy.
Thus, using statistical (or other analytic) procedures, the model selection problem consists of the variable selection problem and the method selection problem. The consequences are catastrophic: If statistical methods are used in the context of modeling, the whole statistical framework turns into a blackbox, because the selection of a particular method can’t be justified in any respect. In contrast to that quite unfavorable situation, methods like the SelfOrganizing Map provide access to any of its parameters. Data experiments are only possible with methods like SOM or ANN. Not the SOM or the ANN are „blackboxes“, but the statistical framework must be regarded as such. Precisely this is also the reason for the still ongoing quarrels about the foundations of the statistical framework. There are two parties, the frequentists and the bayesians. Yet, both are struck by the reference class problem [11]. From our perspective, the current dogma of empirical work in science need to be changed.
The conclusion is that statistical methods should not be used at all to describe realworld data, i.e. for the modeling of realworld processes. They are suitable only within a fully controlled setting, that is, within a data experiment. The first step in any kind of empirical analysis thus must consist of a predictive modeling that includes the model selection task.14
The Perils of Universalism
Many people dealing with the model selection task are mislead by a further irrational phantasm, caused by a mixture of idealism and positivism. This is the phantasm of the single best model for a given purpose.
Philosophers of science long ago recognized, starting with Hume and ultimately expressed by Quine, that empirical observations are underdetermined. The actual challenge posed by modeling is given by the fact of empirical underdetermination. Goodman felt obliged to construct a paradox from it. Yet, there is no paradox, there is only the phantasm of the single best model. This phantasm is a relic from the Newtonian period of science, where everybody thought the world is made by God as a miraculous machine, everything had to be welldefined, and persisting contradictions had to be rated as evil.
Secondarily, this moults into the affair of (semantic) indetermination. Plainly spoken, there are never enough data. Empirical underdetermination results in the actuality of strongly diverging models, which in turn gives rise to conflicting experiences. For a given set of data, in most cases it is possible to build very different models (ceteris paribus, choosing different sets of variables) that yield the same utility, or say predictive power, as far as this predictive power can be determined by the available data sample at all. Such ceteris paribus difference will not only give rise to quite different tracks of unfolding interpretation, it is also certainly in the close vicinity of Derrida’s deconstruction.
Empirical underdetermination thus results in a secondorder risk, the model selection risk. Actually, the model selection risk is the only relevant risk. We can’t change the available data, and data are always limited, sometimes just by their puniness, sometimes by the restrictions to deal with them. Risk is not attached to objects or phenomena, because objects “are not there” before interpretation and modeling. Risk is attached only to models. Risk is a particular state of affair, and indeed a rather fundamental one. Once a particular model would tell us that there is an uncertainty regarding the outcome, we could take measures to deal with that uncertainty. For instance, we hedge it, or organize some other kind of insurance for it. But hedging has to rely on the estimation of the uncertainty, which is dependent on the expected predictive power of the model, not just the accuracy of the model given the available data from a limited sample.
Different, but equivalent selections of variables can be used to create a group of models as „experts“ on a given task to decide on. Yet, the selection of such „experts“ is not determinable on the basis of the given data alone. Instead, further knowledge about the relation of the variables to further contexts or targets needs to be consulted.
Universalism is usually unjustifiable, and claiming it instead usually comes at huge costs, caused by undetectable blindnesses once we accept it. In contemporary empiricism, universalism—and the respecting blindness—is abundant also with regard to the role of the variables. What I am talking about here is context, mediality and individuality, which, from a more traditional formal perspective, is often approximated by conditionality. Yet, it more and more becomes clear that the Bayesian mechanisms are not sufficient to get the complexity of the concept of variables covered. Just to mention the current developments in the field of probability theory I would like to refer to Brian Weatherson, who favors and develops the socalled dynamic Keynesian models of uncertainty. [10] Yet, we regard this only as a transitional theory, despite the fact that it will have a strong impact on the way scientists will handle empiric data.
The mediating individuality of observables (as deliberately chosen assignates, of course) is easy to observe, once we drop the universalism qua independence of variables. Concerning variables, universalism manifests in an indistinguishability of the choices made to establish the assignates with regard to their effect onto the system of preferences. Some criteria C will induce the putative objects as distinguished ones only, if another assignate A has presorted it. Yet, it would be a simplification to consider the situation in the Bayesian way as P(CA). The problem with it is that we can’t say anything about the condition itself. Yet, we need to “play” (actually not “control”) with the conditionability, the inner structure of these conditions. As it is with the “relation,” which we already generalized into randolations, making it thereby measurable, we also have to go into the condition itself in order to defeat idealism even on the structural level. An appropriate perspective onto variables would hence treat it as a kind of media. This mediality is not externalizable, though, since observables themselves precipitate from the mediality, then as assignates.
What we can experience here is nothing else than the first advents of a real postmodernist world, an era where we emancipate from the compulsive apriori of independence (this does not deny, of course, its important role in the modernist era since Descartes).
Optimization
Optimizing a model means to select a combination of suitably valued parameters such that the preferences of the users in terms of risk and implied costs are served best. The model selection problem is thus the link between optimization problems, learning tasks and predictive modeling. There are indeed countless many procedures for optimization. Yet, the optimization task in the context of model selection is faced with a particular challenge: its mere size. George begins his article in the following way:
A distinguishing feature of variable selection problems is their enormous size. Even with moderate values of p, computing characteristics for all 2p models is prohibitively expensive and some reduction of the model space is needed.
Assume for instance a data set that comprises 50 variables. From that 1.13e15 models are possible, and assume further that we could test 10‘000 models per second, then we still would need more than 35‘000 years to check all models. Usually, however, building a classifier on a realworld problem takes more than 10 seconds, which would result in 3.5e9 years in the case of 50 variables. And there are many instances where one is faced with much more variables, typically 100+, and sometimes going even into the thousands. That’s what George means by „prohibitively“.
There are many proposals to deal with that challenge. All of them fall into three classes: they use either (1) some information theoretic measure (AIC, BIC, CIC etc. [11]), or (2) they use likelihood estimators, i.e. they conceive of parameters themselves as random variables, or (3) they are based of probabilistic measures established upon validation procedures. Particularly the instances from the first two of those classes are hit by the linearity and/or the independence assumption, and also by unjustified universalism. Of course, linearity should not be an assumption, it should be a result, as we argued above. Hence, there is no way to avoid the explicit calculation of models.
Given the vast number of combinations of symbols it appears straightforward to conceive of the model selection problem from an evolutionary perspective. Evolution always creates appropriate and suitable solutions from the available „evolutionary model space“. That space is of size 230‘000 in the case of humans, which is a „much“ larger number than the number of species ever existent on this planet. Not a single viable configuration could have been found by pure chance. Geneticsbased alignment and navigation through the model space is much more effective than chance. Hence, the socalled genetic algorithms might appear on the radar as the method of choice .
Genetics, revisited
Unfortunately, for the variable selection problem genetic algorithms15 are not suitable. The main reason for this is still the expensive calculation of single models. In order to set up the genetic procedure, one needs at least 500 instances to form the initial population. Any solution for the variable selection problem should arrive at a useful solution with less than 200 explicitly calculated models. The great advantage of genetic algorithms is their capability to deal with solution spaces that contain local extrema. They can handle even solution spaces that are inhomogeneously rugged, simply for the reason that recombination in the realm of the symbolic does not care about numerical gradients and criteria. Genetic procedures are based on combinations of symbolic encodings. The continuous switch between the symbolic (encoding) and the numerical (effect) are nothing else than the precursors of the separation between genotypes and phenotypes, without which there would not be even simple forms of biological life.
For that reason we developed a specialized instantiation of the evolutionary approach (implemented in SomFluid). Described very briefly we can say that we use evolutionary weights as efficient estimators of the maximum likelihood of parameters. The estimates are derived from explicitly calculated models that vary (mostly, but not necessarily ceteris paribus) with respect to the used variables. As such estimates, they influence the further course of the exploration of the model space in a probabilistic manner. From the perspective of the evolutionary process, these estimates represent the contribution of the respective parameter to the overall fitness of the model. They also form a kind of longterm memory within the process, something like a probabilistic genome. The shortterm memory in this evolutionary process is represented by the intensional profiles of the nodes in the SOM.
For the first initializing step, the evolutionary estimates can be estimated themselves by linear procedure like the PCA, or by nonparametric procedures (KruskalWallis, MannWhitney, etc.), and are available after only a few explicitly calculated models (model here means „ceteris paribus selection of variables“).
These evolutionary weights reflect the changes of the predictive power of the model when adding or removing variables to the model. If the quality of the model improves, the evolutionary weight increases a bit, and vice versa. In other words, not the apriori parameters of the model are considered, but just the effect of the parameters. The procedure is an approximating repetition: fix the parameters of the model (method specific, sampling, variables), calculate the model, record the change of the predictive power as compared to the previous model.
Upon the probabilistic genome of evolutionary weights there are many different ways one could take to implement the “evodevo” mechanisms, let it be the issue of how to handle the population (e.g. mixing genomes, aspects of virtual ecology, etc.), or the translational mechanisms, so to speak the “physiologies” that are used to proceed from the genome to an actual phenotype.
Since many different combinations are being calculated, the evolutionary weight represents the expectable contribution of a variable to the predictive power of the model, under whatsoever selection of variables that represents a model. Usually, a variable will not improve the quality of the model irrespective to the context. Yet, if a variable indeed would do so, we not only would say that its evolutionary weight equals 1, we also may conclude that this variable is a socalled confounder. Including a confounder into a model means that we use information about the target, which will not be available when applying the model for classification of new data; hence the model will fail disastrously. Usually, and that’s just a further benefit of dropping the independenceuniversalism assumption, it is not possible for a procedure to identify confounders by itself. It is also clear that the capability to do so is one of the cornerstones of autonomous learning, which includes the capability to set up the learning task.
Noise, and Noise
Optimization raises its own followup problems, of course. The most salient of these is socalled overfitting. This means that the model gets suitably fitted to the available observations by including a large number of parameters and variables, but it will return wrong predictions if it is going to be used on data that are even only slightly different from the observations used for learning and estimating the parameters of the model. The model represents noise, random variations without predictive value.
As we have been describing above, Poggio believes that his criterion of stability overcomes the defects with regard to the model as a generalization from observations. Poggio might be too optimistic, though, since his method still remains to be confined to the available observations.
In this situation, we apply a methodological trick. The trick consists in turning the problem into a target of investigation, which ultimately translates the problem into an appropriate rule. In this sense, we consider noise not as a problem, but as a tool.
Technically, we destroy the relevance of the differences between the observations by adding noise of a particular characteristic. If we add a small amount of normally distributed noise, nothing will probably change, but if we add a lot of noise, perhaps even of secondarily changing distribution, this will result in the mere impossibility to create a stable model at all. The scientific approach is to describe the dependency between those two unknowns, so to say, to set up a differential between noise (model for the unknown) and the model (of the unknown). The rest is straightforward: creating various data sets that have been changed by imposing different amounts of noise of a known structure, and plotting the predictive power against the the amount of noise. This technique can be combined by surrogating the actual observations via a Cholesky decomposition.
From all available models then those are preferred that combine a suitable predictive power with suitable degree of stability against noise.
Résumé
In this section we have dealt with the problematics of selecting a suitable subset from all available observables (neglecting for the time being that model selection involves the method’s parameters, too). Since mostly we have more observables at our disposal than we actually presume to need, the task could be simply described as simplification, aka Occam’s Razor. Yet, it would be terribly naive to first assume linearity and then selecting the “most parsimonious” model. It is even cruel to state [9, p.1]:
It is said that Einstein once said
Make things as simple as possible, but not simpler.
I hope that I succeeded in providing some valuable hints for accomplishing that task, which above all is not a quite simple one. (etc.etc. :)
Describing Classifiers
The gold standard for describing classifiers is believed to be the ReceiverOperatorCharacteristic, or short, ROC. Particularly, the area under the curve is compared across models (classifiers). The following Figure 1demonstrates the mechanics of the ROC plot.
Figure 1: Basic characteristics of the ROC curve (reproduced from Wikipedia)
Figure 2. Realistic ROC curves, though these are typical for approaches that are NOT based on subgroup structures or ensembles (for instance ANN or logistic regression). Note that models should not be selected on the basis of the AreaunderCurve. Instead the true positive rate (sensitivity) at a false positive rate FPR=0 should be used for that. As a further criterion that would indicate the stability of of the model one could use the slope of the curve at FPR=0.
Utilization of Information
There is still another harmful aspect of the universalistic stance in data analysis as compared to a pragmatic stance. This aspect considers the „reach“ of the models we are going to build.
Let us assume that we would accept a sensitivity of approx 80%, but we also expect a specificity of >99%. In other words, the cost for false positives (FP) are defined as very high, while the costs for false negatives (FN, not recognized preferred outcomes) are relatively low. The ratio of costs for error, or in short the error cost ratio err(FP)/err(FN) is high.
Table 1a: A Confusion matrix for a quite performant classifier.
Symbols: test=model; TP=true positives; FP=false positives; FN=false negatives; TN=true negatives; ppv=positive predictive value, npv=negative predictive value. FN is also called typeIerror (analogous to “rejecting the null hypothesis when it is true”), while FP is called typeIIerror (analogous to “accepting the null hypothesis when it is false”), and FP/(TP+FP) is called typeIIerrorrate, sometime labeled as βerror, where (1β) is the called the “power” of the test or model. (download XLS example)
condition Pos 
condition Neg 

test Pos 
100 (TP) 
3 (FP) 
0.971 
ppv 

test Neg 
28 (FN) 
1120 (TN) 
0.976 
npv 

0.781 
0.997 

sensitivity 
specificity 
Let us further assume that there are observations of our preferred outcome that we can‘t distinguish well from other cases of the opposite outcome that we try to avoid. They are too similar, and due to that similarity they form a separate group in our selforganizing map. Let us assume that the specificity of these clusters is at 86% only and the sensitivity is at 94%.
Table 1b: Confusion matrix describing a subgroup formed inside the SOM, for instance as it could be derived from the extension of a “node”.
condition Pos 
condition Neg 

test Pos 
0 (50) 
0 (39) 
0.0 (0.56) 
ppv 

test Neg 
50 (0) 
39 (0) 
0.44 (1.0) 
npv 

0.0 (1.0) 
1.0 (0.0) 

sensitivity 
specificity 
Yet, this cluster would not satisfy our risk attitude. If we would use the SOM as a model for classification of new observations, and the new observation would fall into that group (by means of similarity considerations) the implied risk would violate our attitude. Hence, we have to exclude such clusters. In the ROC this cluster represents a value further to the right on the specificity (X) axis.
Note that in the case of acceptance of the subgroup as a representative for a contributor of a positive prediction, the false negative is always 0 aposteriori, and in case of denial the true positives is always set to 0 (accordingly the figures for the condition negative).
There are now several important points to that, which are related to each other. Actually, we should be interested only in such subgroups with specificity close to 1, such that our risk attitude is well served. [13] Likewise, we should not try to optimize the quality of the model across the whole range of the ROC, but only for the subgroups with acceptable error cost ratio. In other words, we use the available information in a very specific manner.
As a consequence, we have to set the ECR before calculating the model. Setting the ECR after the selection of a model results in a waste of information, time and money. For this reason it is strongly indicated to use methods that are based on building a representation by subgroups. This again rules out statistical methods as they always take into account all available data. Zytkow calls such methods empirically empty [14].
The possibility to build models of a high specificity is a huge benefit of subgroup based methods like the SOM.16 To understand this better let us assume we have a SOMbased model with the following overall confusion matrix.
condition Pos 
condition Neg 

test Pos 
78 
1 
0.9873 
ppv 

test Neg 
145 
498 
0.7745 
npv 

0.350 
0.998 

sensitivity 
specificity 
That is, the model recognizes around 35% of all preferred outcomes. It does so on the basis of subgroups that all satisfy the respective ECR criterion. Thus we know that the implied risk of any classification is very low too. In other words, such models recognize whether it is allowed to apply them. If we apply them and get a positive answer, we also know that it is justified to apply them. Once the model identifies a preferred outcome, it does so without risk. This lets us miss opportunities, but we won’t be trapped by false expectations. Such models we could call autoconsistent.
In a practical project that has been aiming at an improvement of the postsurgery risk classification of patients (n>12’000) in a hospital we have been able to demonstrate that the achievable validated rate of implied risk can be as low as <10e4. [15] Such a low rate is not achievable by statistical methods, simply because there are far too few incidents of wrong classifications. The subjective cutoff points in logistic regression are not quite suitable for such tasks.
At the same time, and that’s probably even more important, we get a suitable segmentation of the observations. All observations that can be identified as positive do not suffer from any risk. Thus, we can investigate the structure of the data for these observations, e.g. as particular relationships between variables, such as correlations etc. But, hey, that job is already done by the selection of the appropriate set of variables! In other words, we not only have a good model, we also have found the best possibility for a multivariate reduction of noise, with a full consideration of the dependencies between variables. Such models can be conceived as reversed factorial experimental design.
The property of autoconsistency offers a further benefit as it is scalable, that is, “autoconsistent” is not a categorical, or symbolic, assignment. It can be easily measured as sensitivity under the condition of specificity > 1ε, ε→0. Thus, we may use it as a random measure (it can be described by its density) or as a scale of reference in case of any selection task among subpopulations of models. Additionally, if the exploration of the model space does not succeed in finding a model of a suitable degree of autoconsistency, we may conclude that the quality of the data is not sufficient. Data quality is a function of properly selected variables (predictors) and reproducible measurement. We know of no other approach that would be able to inform about the quality of the data without referring to extensive contextual “knowledge”. Needless to say that such knowledge is never available and encodable.
There are only weak conditions that need to be satisfied. For instance, the same selection of variables need to be used within a single model for all similarity considerations. This rules out all ensemble methods, as far as different selections of variables are used for each item in the ensemble; for instance decision tree methods (a SOM with its subgroups is already “ensemblelike”, yet, all subgroups are affected by the same selection of variables). It is further required to use a method that performs the transition from extensions to intensions on a subgroup level,which rules out analytic methods, and even Artificial Neural Networks (ANN). The way to establish autoconsistent models is not possible for ANN. Else, the errorcost ratio must be set before calculating the model, and the models have to be calculated explicitly, which removes linear methods from the list, such as Support Vector Machines with linear kernels (regression, ANN, Bayes). If we want to access the rich harvest of autoconsistent models we have to drop the independence hypothesis and we have to refute any kind of universalism. But these costs are rather low, indeed.
Observations and Probabilities
Here we developed a particular perspective onto the transition from observations to intensional representations. There are of course some interesting relationships of our point of view to the various possibilities of “interpreting” probability (see [16] for a comprehensive list of “interpretations” and interesting references). We also provide a new answer to Hume’s problem of induction.
Hume posed the question, how often should we observe a fact until we could consider it as lawful? This question, called the “problem of induction” points to the wrong direction and will trigger only irrelevant answer. Hume, living still in times of absolute monarchism, in a society deeply structured by religious beliefs, established a shortcut between the frequency of an observation and its propositional representation. The actual question, however, is how to achieve what we call an “observation”.
In very simple, almost artificial cases like the die there is nothing to interpret. The die and its values are already symbols. It is in some way inadequate to conceive of a die or of dicing as an empirical issue. In fact, we know before what could happen. The universe of the die consists of precisely 6 singular points.
Another extreme are socalled singlecase observations of structurally rich events, or processes. An event, or a setting should be called structurally rich, if there are (1) many different outcomes, and (2) many possible assignates to describe the event or the process. Such events or processes will not produce any outcome that is could be expected by symbolic or formal considerations. Obviously, it is not possible to assign a relative frequency to a unique, a singular, or a nonrepeatable event. Unfortunately, however, as Hájek points out [17], any actual sequence can be conceived of as a singular event.
The important point now is that singlecase observations are also not sufficiently describable as an empirical issue. Ascribing propensities to objectsintheworld demands for a wealth of modeling activities and classifications, which have to be completed apriori to the observation under scrutiny. Socalled singlecase propensities are not a problem of probabilistic theory, but one of the application of intensional classes and their usage as means for organizing one’s own expectations. As we said earlier, probability as it is used in probability theory is not a concept that could be applied meaningful to observations, where observations are conceived of as primitive “givens”. Probabilities are meaningful only in the closed world of available subjectively held concepts.
We thus have to distinguish between two areas of application for the concept of probability: the observational part, where we build up classes, and the anticipatory part, where we are interested in a match of expectations and actual outcomes. The problem obviously arises by mixing them through the notion of causality.17 Yet, there is absolutely no necessity between the two areas. The concept of risk probably allows for a resolution of the problems, since risk always implies a preceding choice of a cost function, which necessarily is subjective. Yet, the cost function and the risk implied by a classification model is also the angle point for any kind of negotiation, whether this takes place on an material, hence evolutionary scale, or within a societal context.
The interesting, if not salient point is that the subjectively available intensional descriptions and classes are dependent on ones risk attitude. We may observe the same thing only if we have acquired the same system of related classes and the same habits of using them. Only if we apply extreme risk aversion we will achieve a common understanding about facts (in the Wittgensteinian sense, see above). This then is called science, for instance. Yet, it still remains a misunderstanding to equate this common understanding with objects as objectsoutthere.
The problem of induction thus must be considered as a seriously illposed problem. It is a problem only for idealists (who then solve it in a weird way), or realists that are naive against the epistemological conditions of acting in the world. Our proposal for the transition from observations to descriptions is based on probabilism on both sides, yet, on either side there is a distinct flavor of probabilism.
Finally, a methodological remark shall be allowed, closely related to what we already described in the section about “noise” above. The perspective onto “making experience” that we have been proposing here demonstrates a significant twist.
Above we already mentioned Alan Hájek’s diagnosis that the frequentist and the Bayesian interpretation of probabilities suffer from the reference class problem. In this section we extended Hájek’s concerns to the concept of propensity. Yet, if the problem shows a high prevalence we should not conceive it as a hurdle but should try to treat it dynamically as a rule.The reference class is only a problem as long as (1) either the actual class is required as an external constant, or (2) the abstract concept of the class is treated as a fixed point. According to the rule of LagrangeDeleuze, any constant can be rewritten into a procedure (read: rules) and less problematic constants. Constants, or fixed points on a higher abstract level are less problematic, because the empirically grounded semantics vanishes.
Indeed, the problem of the reference class simply disappears if we put the concept of the class, together with all the related issues of modeling, as the embedding frame, the condition under which any notion of probability only can make sense at all. The classes itself are results of “rulefollowing”, which admittedly is blind, but whose parameters are also transparently accessible. In this way, probabilistic interpretation is always performed in a universe, that is closed and in principle fully mapped. We need the probabilistic methods just because that universe is of a huge size. In other words, the space of models is a Laplacean Universe.
Since statistical methods and similar interpretations of probability are analytical techniques, our proposal for a repositioning of statistics into such a Laplacean Universe is also well aligned with the general habit of Wittgenstein’s philosophy, which puts practiced logic (quasilogic) second to performance.
The disappearance of the reference class problem should be expected if our relations to the world are always mediated through the activity with abstract, epistemic modeling. The usage of probability theory as a “conceptual game” aiming for sharing diverging attitudes towards risks appears as nothing else than just a particular style of modeling, though admittedly one that offers a reasonable rate of success.
The Result of Modeling
It should be clear by now, that the result of modeling is much more than just a single predictive model. Regardless whether we take the scientific perspective or a philosophical vantage point, we need to include operationalizations of the conditions of the model, that reach beyond the standard empirical risk expressed as “false classification”. Appropriate modeling provides not only a set of models with wellestimated stability and of different structures; a further goal is to establish models that are autoconsistent.
If the modeling employs a method that exposes its parameters, we even can avoid the „method hell“, that is, the results are not only reliable, they are also valid.
It is clear that only autoconsistent models are useful for drawing conclusions and in building up experience. If variables are just weighted without actually being removed, as for instance in approaches like the Support Vector Machines, the resulting methods are not autoconsistent. Hence, there is no way towards a propositional description of the observed process.
Given the population of explicitly tested models it is also possible to describe the differential contribution of any variable to the predictive power of a model. The assumption of neutrality or symmetry of that contribution, as it is for instance applied in statistical learning, is a simplistic perspective onto the variables and the system represented by them.
Conclusion
In this essay we described some technical aspects of the capability to experience. These technical aspects link the possibility for experience to the primacy of interpretation that gets actualized as the techné of anticipatory, i.e. predictive or diagnostic modeling. This techné does not address the creation or derivation of a particular model by means of employing one or several methods. The process of building a model could be fully automated anyway. Quite differently, it focuses the parametrization, validation, evaluation and application of models, particularly with respect to the task of extract a rule from observational data. This extraction of rules must not be conceived as a “drawing of conclusions” guided by logic. It is a constructive activity.
The salient topics in this practice are the selection of models and the description of the classifiers. We emphasized that the goal of modeling should not be conceived as the task of finding a single best model.
Methods like the Selforganizing Map which are based on subgroup segmentation of the data can be used to create autoconsistent models, which represent also an optimally denoised subset of the measured data. This data sample could be conceived as if it would have been found by a factorial experimental design. Thus, autoconsistent models also provide quite valuable hints for the setup of the Taguchi method of quality assurance, which could be seen as a precipitation of organizational experience.
In the context of exploratory investigation of observational data one first has to determine the suitable observables (variables, predictors) and, by means of the same model(s), the suitable segment of observations before drawing domainspecific conclusions. Such conclusions are often expressed as contrasts in location or variation. In the context of designed experiments as e.g. in pharmaceutical research one first has to check the quality of the data, then to denoise the data by removing outliers by means of the same data segmentation technique, before again null hypotheses about expected contrasts could be tested.
As such, autoconsistent models provide a perfect basis for learning and for extending the “experience” of an epistemic individual. According to our proposals this experience does not suffer from the various problems of traditional Humean empirism (the induction problem), or contemporary (defective) theories of probabilism (mainly the problem of reference classes). Nevertheless, our approach remains fully empiricoepistemological.
1. As many other philosophers Lyotard emphasized the indisputability of an attention for the incidential, not as a perceptionas, but as an aisthesis, as a forming impression. see: Dieter Mersch, ›Geschieht es?‹ Ereignisdenken bei Derrida und Lyotard. available online, last accessed May 1st, 2012. Another recent source arguing into the same direction is John McDowell’s “Mind and World” (1996).
2. The label “representationalism” has been used by Dreyfus in his critique of symbolic AI, the thesis of the “computational mind” and any similar approach that assumes (1) that the meaning of symbols is given by their reference to objects, and (2) that this meaning is independent of actual thoughts, see also [2].
3. It would be inadequate to represent such a twofold “almost” dichotomy as a 2axis coordinate system, even if such a representation would be a metaphorical one only; rather, it should be conceived as a tetraedic space, given by two vectors passing nearby without intersecting each other. Additionally, the structure of that space must not expected to be flat, it looks much more like an inhomogeneous hyperbolic space.
4. “Experiencibility” here not understood as an individual capability to witness or receptivity, but as the abstract possibility to experience.
5. In the same way we reject Husserl’s phenomenology. Phenomena, much like the objects of positivism or the thingassuch of idealism, are not “out there”, they are result of our experiencibility. Of course, we do not deny that there is a materiality that is independent from our epistemic acts, but that does not explain or describe anything. In other words we propose go subjective (see also [3]).
6. Again, mechanism here should not be misunderstood as a single deterministic process as it could be represented by a (trivial) machine.
7. This question refers to the famous passage in the Tractatus, that “The world is everything that is the case.“ Cases, in the terminology of the Tractatus, are facts as the existence of states of affairs. We may say, there are certain relations. In the Tractatus, Wittgenstein excluded relations that could not be explicated by the use of symbols., expressed by the 7th proposition: „Whereof one cannot speak, thereof one must be silent.“
8. We must step outside of language in order to see the working of language.
9. We just have to repeat it again, since many people develop misunderstandings here. We do not deny the material aspects of the world.
10. “individual” is quite misleading here, since our brain and even our mind is not indivisable in the atomistic sense.
11. thus, it is also not reasonable to claim the existence of a somehow dualistic language, one part being without ambiguities and vagueness, the other one establishing ambiguity deliberately by means of metaphors. Lakoff & Johnson started from a similar idea, yet they developed it into a direction that is fundamentally incompatible with our views in many ways.
12. Of course, the borders are not well defined here.
13. “predictive power” could be operationalized in quite different ways, of course….
14. Correlational analysis is not a candidate to resolve this problem, since it can’t be used to segment the data or to identify groups in the data. Correlational analysis should be performed only subsequent to a segmentation of the data.
15. The socalled genetic algorithms are not algorithms in the narrow sense, since there is no welldefined stopping rule.
16. It is important to recognize that Artificial Neural Networks are NOT belonging to the family of subgroup based methods.
17. Here another circle closes: the concept of causality can’t be used in a meaningful way without considering its close amalgamation with the concept of information, as we argued here. For this reason, Judea Pearl’s approach towards causality [16] is seriously defective, because he completely neglects the epistemic issue of information.
 [1] Geoffrey C. Bowker, Susan Leigh Star. Sorting Things Out: Classification and Its Consequences. MIT Press, Boston 1999.
 [2] Willian Croft, Esther J. Wood, Construal operations in linguistics and artificial intelligence. in: Liliana Albertazzi (ed.) , Meaning and Cognition. Benjamins Publ, Amsterdam 2000.
 [3] Wilhelm Vossenkuhl. Solipsismus und Sprachkritik. Beiträge zu Wittgenstein. Parerga, Berlin 2009.
 [4] Douglas Hofstadter, Fluid Concepts And Creative Analogies: Computer Models Of The Fundamental Mechanisms Of Thought. Basic Books, New York 1996.
 [5] Nicholas F. Gier, Wittgenstein and Deconstruction, Review of Contemporary Philosophy 6 (2007); first publ. in Nov 1989. Online available.
 [6] Henk L. Mulder, B.F.B. van de VeldeSchlick (eds.), Moritz Schlick, Philosophical Papers, Volume II: (19251936), Series: Vienna Circle Collection, Vol. 11b, Springer, Berlin New York 1979. with Google Books
 [7] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee & Partha Niyogi (2004). General conditions for predictivity in learning theory. Nature 428, 419422.
 [8] Vladimir Vapnik, The Nature of Statistical Learning Theory (Information Science and Statistics). Springer 2000.
 [9] Herman J. Bierens (2006). Information Criteria and Model Selection. Lecture notes, mimeo, Pennsylvania State University. available online.
 [10 ]Brian Weatherson (2007). The Bayesian and the Dogmatist. Aristotelian Society Vol.107, Issue 1pt2, 169–185. draft available online
 [11] Edward I. George (2000). The Variable Selection Problem. J Am Stat Assoc, Vol. 95 (452), pp. 13041308. available online, as research paper.
 [12] Alan Hájek (2007). The Reference Class Problem is Your Problem Too. Synthese 156(3): 563585. draft available online.
 [13] Lori E. Dodd, Margaret S. Pepe (2003). Partial AUC Estimation and Regression. Biometrics 59( 3), 614–623.
 [14] Zytkov J. (1997). Knowledge=concepts: a harmful equation. 3^{rd} Conference on Knowledge Discovery in Databases, Proceedings of KDD97, p.104109.AAAI Press.
 [15] Thomas Kaufmann, Klaus Wassermann, Guido Schüpfer (2007). Beta error free risk identification based on SPELA, a neuroevolution method. presented at ESA 2007.
 [16] Alan Hájek, “Interpretations of Probability”, The Stanford Encyclopedia of Philosophy (Summer 2012 Edition), Edward N. Zalta (ed.), available online, or forthcoming.
 [17] Judea Pearl, Causality – Models, Reasoning, and Inference. 2nd ed. Cambridge University Press, Cambridge (Mass.) 2008 [2000].
۞
Leave a Reply