Technical Aspects of Modeling
December 21, 2011 § Leave a comment
Modeling is not only inevitable in an empirical world,
it is also a primary practice. Being in the world as an empirical being thus means continuous modeling. Modeling is not an event-like activity, it is much more like collecting the energy in an photovoltaic device. This does not apply only to living systems, it also should apply to economic organizations.
Modeling thus comprises much more than selecting some data from a source and applying a method or algorithm to them. You may conceive that difference metaphorically as the difference between a machine and a plant for producing some goods. Here in this chapter we first will identify and briefly characterize the elements of continuous modeling; then we will show the overall arrangement of those elements as well as the structure of the modeling core process. We then will even step further down to the level of properties a modeling software for (continuous) modeling should comprise.
You will find much more details and a thorough discussion of the various design decisions for the respective software system in the attached document “The SPELA-Approach to Predictive Modeling.” This acronym stands for “Self-Configuring Profile-Based Evolutionary Learning Approach.” The document also describes how the result of modeling based on SPELA may be used properly for reasoning about the data.
Elements of Modeling
As we have shown in the chapter about the generalized model, a model needs a purpose. This targeted modeling and its internal organization is the subject of this chapter. Here we will not deal with the problem of modeling unstructured data such as texts or images. Understanding language tokens requires a considerable extension of the modeling framework, despite the fact, that modeling as outlined here remains an important part of understanding language tokens. Those extensions mainly concern an appropriate probabilization of what we experience as words or sentences. We will discuss this elsewhere, more technical here, fully contextualized here.
Such automated modeling also can be run as a continuous process. Its main elements are the following:
- (1) post-measurement phase: selecting and importing data;
- (2) extended classification by a core process group:building an intensional representation;
- (3) reflective post-processing (validation), meta-modeling, based on a (self-)monitoring repository;
- (4) harvesting results and/or aligning measurement
The elements of (continuous, automated) modeling needs to be arranged according to the following multi-layered, multi-loop organizational scheme:
Figure 1: Organizational elements for automated, continuous modeling; L<n> = loop levels; T=transformation of data, S=segmentation of data, R=detecting pairwise relationships between variables and identifying them as mathematical functions F(d)=f(x,y), F(d) being a non-linear function improving the discriminant power represented by the typology derived in S, PP=post-processing, e.g. creating dependency path diagrams, which are connected through 4 levels of loops, L1 thru L4, where L1=finding a semi-ordered list of optimized models, L2=introducing the relationships found by R into data transformation, L3=additional sampling of raw data based on post-processing of core process group (active sampling), e.g. for cross-validation purposes, and finally L4=adapting the objective of the modeling process based on the results presented to the user. Feedback-level L4 may be automated through selecting from pre-configured modeling policies, where a policy is a set of rules and sets of parameters controlling the feedback levels L1 thru L3 as well as the core modules. DB = some kind of data source, e.g. a database;
This scheme may be different to anything you have seen so far about modeling. Common software packages, whether commercial (SPSS, SAS, S-Plus, etc.) or open source (R, Weka, Orange) do not natively support this scheme. Some of them would allow for a similar scheme, but it is hard to accomplish it. For instance, the transformation part is not properly separated and embedded in the overall process, there is no possibility to screen for pairwise relationships, which then are automatically actualized as further transformation of the “data.” There is no meta-data and no abstraction inherent to the process. As a consequence, literally everything is left on the side of the user, rendering those softwares into gigantic formalisms. This comes, on the other hand, only with little surprise, given the current paradigm of deterministic computing.
The main reason, however, for the incapability of any of these softwares is the inappropriate theory behind them. Neither the paradigm of statistics nor that of “data mining” is applicable at all to the task of automated and continuous modeling.
Anyway, next we will describe the loops appearing in the scheme. The elements of the core process we will describe in detail later.
Here we should mention another process-oriented approach for predictive modeling, the CRISP-M scheme, which has been published as early as 1997 as a result of an initiative launched by NEC. CRISP-M stands for Cross-industry standard for predictive modeling. However, the CRISP-M is of a hopelessly solipsistic character and only of little value.
Before we start we should note that the scheme above reflects an ideal and rather simple situation. More often than not, a nested, if not fractal structure appears, especially regarding loop levels L1 and L2.
Loop Level 1: Resonance
Here we find the associative structure, e.g. a self-organizing map. An important requirement is that this mechanism is working bottom-up, and a consequence of this is that it is an approximate mechanism.
The purpose of this element is to perform a segmentation of the available data, given a particular data space as defined by the “features,” or more appropriate, the assignates (see the chapter about the generalized model for this).
It is important to understand, that it is impossible for the segmentation mechanism to change the structure of the available data space. Loop level L1 also provides what is called the transition from extensional to intensional description.
L1 performs also feature selection. Given a set of features FO, representing descriptional “dimensions” or aspects of the observations O, many of those features are not related to the intended target of the model. Hence, they introduce noise and have to be removed, which results, in other words, in a selection of the remaining.
In many applications, there are large numbers of variables, especially if L2 will be repeated, resulting in a vast number of possible selections. The number of possible combinations from the set of assignates easily exceeds 1020, and sometime even 10100. This is a larger quantity than the number of sub-atomar particles in the visible universe. The only way to find a reasonable proposal for a “good” selection is by means of an evolutionary mechanism. Formal, probabilistic approaches will fail.
The result of this step is a segmentation that can be represented as a table. The rows represent profiles of prototypes, while the columns show the selected features (for further details see below in the section about the core process)
Loop Level 2: Hypothetico-deductive Transformation
This step starts with a fixed segmentation based on a particular selection ℱ out of FO. The prototypes identified by L1 are the input data for a screening that employs analytic transformations of values within (mostly) pairwise selected variables, such like f(a,b) = a*b, or : f(a,b) = 1/(a+b). Given a fixed set of analytic functions, a complete screening is performed for all possible combinations. Typically, several millions of individual checks are performed.
It is very important to understand that not the original data are used as input, but instead the data on the level of the intensional description, i.e. a first-order abstraction of the data.
Once the most promising transformations have been identified, they are introduced automatically into the set of original transformations in the element T of figure 1.
Loop Level 3: Adaptive Sampling
see legend for figure 1
Loop Level 4: Re-Orientation
While the use aspects are of course already reflected by the target variable and the selected risk structure, there is a further important aspect concerning the “usage” of models. Up to level 3 the whole modeling process can be run in an autonomous manner. Yet, not so on level 4.
Level 4 and its associated loop has been included in the modeling scheme as a dedicated means for re-orientation. The results of a L3 modeling raid could lead to “insights” that change the preference structure of user. Upon this change in her/his preferences, the user could choose a different risk structure, or even a different target, perhaps also to create a further model with a complementary target.
These choices are obviously dependent on external influences such as organizational issues, or limitations / opportunities regarding the available resources.
Structure of the Modeling Core Process
|2. Goal-oriented.. .Segmentation||3. Artificial Evolution||4. Dependencies|
|P = putative property (“assignate”)
F = arbitrary function
var = “raw” variable(s)
of associations between variables
|complete calculation of relations as analytic functions|
Figure 2: Organizational elements of the modeling core process. The bottom row is showing important keywords
Transformation of Data
This step performs a purely formal, arithmetic and hence analytic transformation of values in a data table. Examples are :
- – the log-transformation of a single variable, shifting the mode of the distribution to the right, thus allowing for a better discrimination of small values; one can also use it to create missing-values in order to a adaptively filter certain values, and thus, observations;
- – combinatorial synthesis of new variables from 2+ variables, which is resulting in a stretching, warping or folding of the parameter space;
- – separating values from one variable into two new and mutually exclusive variables;
- – binning, that is reducing the scale of the variable, say from numeric to ordinal;
- – any statistical measure or procedure, changing the quality of an observation: resulting values are not reflecting observations, but instead represent a weight relative to the statistical measure.
A salient effect of the transformation of data is the increase of the number of variables. Also note, that any of those analytic transformations destroys a little bit of the total information, although it also leads to a better discriminability of certain sub-spaces of the parameter space. Most important, however, is to understand, that any analytic transformation is conceived as an hypothesis. Whether it is appropriate or not can be revealed ONLY by means of a targeted (goal-oriented) segmentation, which implies a cost-function that in turn comprises the operationalization of risk (see the chapter about generalized model).
Any of the resulting variables consist from assignates, i.e. the assigned attributes or features. Due to the transformation they comprise not just the “raw” or primary properties upon the first contact of the observer with the observed, but also all of the transformations applied to such raw properties (aka variables). This results in an extended set of assignates.
We now can also see that transformations of measured data are taking the same role as measurement devices. Initial differences in signals are received and selectively filtered according to the quasi-material properties of the device. The first step in figure 2 above such represents also what could be called generalized measurement.
Transforming data by whatsoever an algorithm or analytic method does NOT create a model. In other words, the model-aspect of statistical models is not in the statistical procedure, precisely because statistical models are not built upon associative mechanisms. The same is true for the widespread “physicalist” modeling e.g. in social sciences or urbanism. In these areas, measured data are often represented by a “formula,” i.e. a purely formal denotation, often in the form of a system of differential equations. Such systems are not by itself a model, because they are analytic rewritings of the data. The model-aspect of such formulas gets instantiated only through associating parts of the measured data with a target variable as an operationalization of the purpose. Without target variable, no purpose, without purpose no model, without model, no association, hence no prediction, no diagnostics, and not any kind of notion of risk. Formal approaches always need further side-conditions and premises before they can be applied. Yet, it is silly to come up with conditions for instantiations of “models” after the model has been built, since those conditions inevitably would lead to a different model. The modeling-aspect, again, is completely moved to the person(s) applying the model, hence such modeling is deeply subjective, implying serious and above all completely invisible risks regarding reproducibility and stability.
We conclude that the pretended modeling by formal methods has to be rated as bad practice.
The segmentation of the data can be represented as a table. The rows represent profiles of prototypes, while the columns show the selected assignates (features); for further details see below in the section about the core process (will be added at a future date!).
In order to allow for a comparison of the profiles, the profiles have to be “homogenous” with respect to their normalized variance. The standard SOM tends to collect “waste” or noise in some clusters, i.e. deeply dissimilar observations are collected in a single group because their dissimilarity. Here we find one of the important modification of the standard SOM as it is widely used. The effect of this modification is of vital size. For other design issues around the Self-organizing Map see the discussion here.
Necessary and even inevitable for screening the vast parameter space.
see about Loop Level 2 above.
In the practice of modeling one can find bad habits regarding any of the elements, loops and steps outlined above. Beginning with the preparation of data there is the myth that missing values need to be “guessed” before an analysis could be done. What would be the justification for the selection of a particular method to “infer” a value that is missing due to incomplete measurement? What do people expect to find in such data? Of course, filling gaps in data before creating a model from it is deep nonsense.
Another myth, still in the early phases of the modeling process, is represented by the belief that analytical methods applied to measurement data “create” a model. They don’t. They just destroy information. As soon as we align the working of the modeling mechanism to some target variable, the whole endeavor is not analytic any more. Yet, without target variable we would not create a model, just re-written measurement values, that even don’t measure “anything”: measurement also needs a purpose. So it would be just silly first to pretend to do measurement and after that to drop that intention by removing the target variable. All of statistics works like that. Whatever statistics is doing, it is not modeling. If someone uses statistics, that person uses just a rewriting tool; the modeling itself remains deeply opaque, based on personal preferences, in short: unscientific.
People recognize more and more that clustering is indispensable for modeling. Yet, many people, particularly in biological sciences (all the -omics) believe that there is a meaningful distinction between unsupervised and supervised clustering, yet that both varieties produce models. That’s deeply wrong. One can not apply, say K-means clustering, or a SOM, without a target variable, that is a cost function, just for checking whether there is “something in the data.” Any clustering algorithm is applying some criteria to separate the observations. Why then should someone believe that precisely the more or less opaque, but surely purely formal, criteria of an arbitrary clustering algorithm should perfectly match to the data at hand? Of course, nobody should believe that. Instead of surrender oneself blindly to some arbitrary algorithmic properties one should think of those criteria as free parameters that have to be tested according to the purpose of the modeling activity.
Another widespread misbehavior concerns what is called “feature selection.” It is an abundant practice first to apply logistic regression to reduce the number of properties, then, in a completely separated second step to apply any kind of “pattern matching” approach. Of course, the logistic regression acts as a kind of filter. But: is this filter compatible to the second method, is it appropriate to the data and the purpose at hand? You will never find out, because you have applied to different methods. It is thus impossible to play the ceteris paribus game. It appears comprehensible to proceed according the split-method approach if you have just paper and pencil at your disposal. It is inexcusable to do so if there are computers available.
Quite to the contrast of the split-method approach one should use a single method that is able to perform feature selection AND data segmentation in the same process.
There are further problematic attitudes concerning the validation of models, especially concerning sampling and risk, which we won’t discuss here.
In this essay we are providing the first general and complete scheme for target oriented modeling. The main structural achievements comprise (1) the separation of analytic transformation, (2) associative sorting, (3) evolutionary optimization of the selection of assignates and (4) the constructive and combinatorial derivation of new assignates.
Note that any (computational) procedure of modeling fits into this scheme, even this scheme itself. Ultimately, any modeling results in a supervised mapping. In the chapters about the abstract formalization of models as categories we argue that models are level-2-categories.
It precisely this separation that allows for an autonomous execution of modeling once the user has determined her target and the risk that appears as acceptable. It depends completely on the context—whether external, organizational or internal and more psychological—and on individual habits how these dimensions of purpose and safety are being configured and handled.
From the perspective of our general interest in machine-based epistemology we clearly can see that target oriented modeling for itself does not contribute too much to that capability. Modeling, even if creating new hypotheses, and even if we can reject the claim that modeling is an analytic activity, necessarily remains within the borders of the space determined by the purpose and the observations.
There is no particular difficulty to run even advanced modeling in an autonomous manner. Performing modeling is an almost material performance. Defining the target and selecting a risk attitude are of course not. Thus, in any predictive or diagnostic modeling the crucial point is to determine those. Particularly the risk attitude implies unrooted believes and thus the embedding into social processes. Frequently, humans even change the target in order to obey to certain limits concerning risk. Thus, in commercial projects the risk should be the only dimension one has to talk about when it comes to predictive / diagnostic modeling. Discussing about methods or tools is nothing but silly.
It is pretty clear that approaching the capability for theory-building needs more than modeling, although target oriented modeling is a necessary ingredient. We will see in further chapters how we can achieve that. The important step will be to drop the target from modeling. The result will be a pre-specific modeling, or associative storage, which serves as a substrate for any modeling that is serving a particular target.
This article was first published 21/12/2011, last revision is from 5/2/2012