Dealing with a Large World

June 10, 2012 § Leave a comment

The world as an imaginary totality of all actual and virtual

relationships between assumed entities can be described in innumerable ways. Even what we call a “characteristic” forms only in a co-dependent manner together with the formation processes of entities and relationships. This fact is particularly disturbing if we encounter something for the first time, without the guidance provided by more or less applicable models, traditions, beliefs or quasi-material constraints. Without those means any selection out of all possible or constructible properties is doomed to be fully contingent, subject to pure randomness.

Yet, this does not result in results that are similarly random. Given that the equipment with tools and methods is given for a task or situation at hand, modeling is for the major part the task to reduce the infiniteness of possible selections in such a way that the resulting representation can be expected to be helpful. Of course, this “utility” is not a hard measure in itself. It is not only dependent on the subjective attitude to risk, mainly the model risk and the prediction risk, utility is also relative to the scale of the scope, in other words, whether one is interested in motor or other purely physical aspects, tactical aspects or strategic aspects, whether one is interested in more local or global aspects, both in time and space, or whether one is interested in any kind of balanced mixture of those aspects. Establishing such a mixture is a modeling task in itself, of course, albeit one that is often accomplished only implicitly.

The randomness mentioned above is a direct corollary of the empirical underdetermination 1. From a slightly different perspective, we also may say that it is an inevitable consequence of the primacy of interpretation. And we also should not forget that language and particularly metaphors in language—and any kind of analogical thinking as well—are means to deal constructively with that randomness, turning physical randomness into contingency. Even within the penultimate guidance of predictivity—it is only a soft guidance though—large parts of what we reasonably could conceive as facts (as temporarily fixed arrangement of relations) is mere collaborative construction, an ever undulating play between the individual and the general.
Even if analogical thinking indeed is the cornerstone, if not the Acropolis, of human mindedness, it is always preceded by and always rests upon modeling. Only a model allows to pick some aspect out of the otherwise unsorted impressions taken up from the “world”. In previous chapters we already discussed quite extensively the various general as well as some technical aspects of modeling, from an abstract as well as from a practical perspective.2 Here we focus on a particular challenge, the selection task regarding the basic descriptors used to set up a particular model.

Well, given a particular modeling task we have the practical challenge to reduce a large set of pre-specific properties into a small set of “assignates” that together represent in some useful way the structure of the dynamics of the system that we’d observed. How to reduce a set of properties created by observation that comprises several hundreds of them?
The particular challenge arises even in the case of linear systems if we try to avoid subjective “cut-off” points that are buried deeply into the method we use. Such heuristic means are wide-spread in statistically based methods. The bad thing about that is that you can’t control their influence onto the results. Since the task comprises the selection of properties for the description of the entities (prototypes) to be formed, such arbitrary thresholds, often justified or even enforced just by the method itself, will exert a profound influence on the semantic level. In other words it corroborates its own assumption of neutrality.

Yet, we also never should assume linearity of a system, because most of the interesting real systems are non-linear, even in the case of trivial machines. Brute force approaches are not possible, because the number of possible models is 2^n, with n the number of properties or variables. Non-linear models can’t be extrapolated from known ones, of course. The Laplacean demon 3 became completely wrapped by Thomean folds 4, being even quite worried by things like Turing’s formal creativity 5.

When dealing with observations from “non-linear entities”, we are faced with the necessity to calculate and evaluate any selection of variables explicitly. Assuming a somewhat phantastic figure of 0.0000001 seconds (10e-6) needed to calculate a single model, we still would need 10E15 years to visit all models if we would have to deal with just 100 variables. To make it more palpable: It would take 80 million times longer than the age of the earth, which is roughly 4.8 billion years…

Obviously, we have to drop the idea that we can “proof” the optimality of a particular model. The only thing we can do is to minimize the probability that within a given time T we can find a better model. On the other hand, the data are not of unbounded complexity, since real systems are not either. There are regularities, islands of stability, so to speak. There is always some structure, otherwise the system would not persist as an observable entity. As a consequence, we can organize the optimization of “failure time probability”, we may even consider this as a second-order optimization. We may briefly note that the actual task thus is not only to select a proper set of variables, we also should identify the relations between the observed and constructed variables. Of course, there are always several if not many sets of variables that we could consider as “proper”, precisely for the reason that they form a network of relations, even if this network is probabilistic in nature and itself being kind of a model.

So, how to organize this optimization? Basically, everything has to be organized as nested, recurrent processes. The overall game we could call learning. Yet, it should be clear that every “move” and every fixation of some parameter and its value is nothing else than a hypothesis. There is no “one-shot-approach”, and no linear progression either.
If we want to avoid naive assumptions—and any assumption that remains untested is de facto a naive assumption—we have to test them. Everything is trial and error, or expressed in a more educated manner, everything has to be conceived as a hypothesis. Consequently we can reduce the number of variables only by a recurrent mechanism. As a lemma we conclude that any approach that reduces the number of variables not in a recurrent fashion can’t be conceived as a sound approach.

Contingent Collinearities

It is the structuredness of the observed entity that cause the similarity of any two observations across all available or apriori chosen properties. We also may expect that any two variables could be quite “similar”6 across all available observations. This provides the first two opportunities for reducing the size of the problem. Note that such reduction by “black-listing” applies only to the first steps in a recurrent process. Once we have evidence that certain variables do not contribute to the predictivity of our model, we may loosen the intensity of any of the reductions! Instead of removing it from the space of expressibility we may preferably achieve a weighted preference list in later stages of modeling.
So, if we find n observations or variables being sufficiently collinear, we could remove a portion p(n) from this set, or we could compress them by averaging.
R1: reduction by removing or compressing collinear records.
R2: reduction by removing or compressing collinear variables.
A feasible criterion for assessing the collinearity is the monotonicity in the relationship between two variables as it is reflected by Spearman’s correlation. We also could apply K-means clustering using all variables, then averaging all observations that are “sufficiently close” to the center of the clusters.
Albeit the respective thresholding is only a preliminary tactical move, we should be aware of the problematics we introduce by such a reduction. Firstly, it is the size of the problem that brings in a notion of irreversibility, even if we are fully aware of the preliminarity. Secondly, R1 is indeed critical because it is in some quite obvious way a petitio principii. Even tiny differences in some variables could be masked by larger differences in such variables that penultimately are recognized as irrelevant. Hence, very tight constraints should be applied when performing R1.
When removing collinear records we else have to care about the outcome indicator. Often, the focused outcome is much less frequent than its “opposite”. Preferably, we should remove records that are marked as negative outcome, up to a ratio of 1:1 between positive and negative outcome in the reduced data. Such “adaptive” sampling is similar to so-called “biased sampling”.

Directed Collinearities

Additionally to those two collinearities there is a third one, which is related to the purpose of the model. Variables that do not contribute to the predictive reconstruction of the outcome we could call “empirically empty”.

R3: reduction by removing empirically empty variables

Modeling without a purpose can’t be considered to be modeling at all 7, so we always have a target variable available that reflects the operationalization of the focused outcome. We could argue that only those variables are interesting for a detailed inspection that are collinear to the target variable.

Yet, that’s a problematic argument, since we need some kind of model to draw the decision whether to exclude a variable or not, based on some collinearity measure. Essentially, that model claims to predict the predictivity of the final model, which of course is not possible. Any such apriori “determination” of the contribution of a variable to the final predictivity of a model is nothing else than a very preliminary guess. Thus, we indeed should treat it just as a guess, i.e. we should consider it as a propensity weight for selecting the variable. In the first explorative steps, however, we could choose an aggressive threshold, causing the removal of many variables from the vector.

Splitting

R1 removes redundancy across observations. The same effect can be achieved by a technique called “bagging”, or similarly “foresting”. In both cases a comparatively small portion of the observations are taken to build a “small” model, where the “bag” or “forest” of all small models then are taken to build the final, compound model. Bagging as a technique of “split & reduce” can be applied also in the variable domain.

R4: reduction of complexity by splitting

Confirming

Once an acceptable model or set of models has been built, we can check the postponed variables one after another. In the case of splitting, the confirmation is implicitly performed by weighting the individual small models.

Compression and Redirection

Elsewhere we already discussed the necessity and the benefits of separating the transformation of data from the association of observations. If we separate it, we can see that everything we need is an improvement or a preservation of the potential distinguishability of observations. The associative mechanism need not to “see” anything that even comes close to the raw data, as long as the resulting association of observations results in a proper derivation of prototypes.8

This opens the possibility for a compression of the observations, e.g. by the technique of random projection. Random projection maps vector spaces onto each other. If the dimensionality of the resulting vector of reduced size remains large enough (100+), then the separability of the vectors is kept intact. The reason is that in a high-dimensional vector space almost all vectors are “orthogonal” to each other. In other words, random projection does not change the structure of the relations between vectors.

R5: reduction by compression

During the first explorative steps one could construct a vector space of d=50, which allows a rather efficient exploration without introducing too much noise. Noise in normalized vector space essentially means to change the “direction” of the vectors, the effect of changing the length of vectors due to random projection is much less profound. Else note that introducing noise is not a bad thing at all: it helps to avoid overfitting, resulting in more robust models.

If we conceive of this compression by means of random projection as a transformation, we could store the matrix of random numbers as parameters of that transformation. We then could apply it in any subsequent classification task, i.e. when we would apply the model to new observations. Yet, The transformation by random projection destroys the semantic link between observed variables and the predictivity of the model. Any of the columns after such a compression contains information from more than one of the input variables. In order to support understanding, we have to reconstruct the semantic link.
That’s fortunately not a difficult task, albeit it is only possibly if we use an index that allows to identify the observations even after the transformation. The result of the building the model is a collection of groups of records, or indices, respectively. Based on these indices we simply identify those variables, which minimize the ratio of variance within the groups to the variance of the means per variable across the groups. This provides us the weights for the list of all variables, which can be used to drastically reduce the list of input variables for the final steps of modeling.

The whole approach could be described as sort of a redirection procedure. We first neglect the linkage between semantics of individual variables and prediction in order to reduce the size of the task, then after having determined the predictivity we restore the neglected link.
This opens the road for an even more radical redirection path. We already mentioned that all we need to preserve through transformation is the distinguishability of the observations without distorting the vectors too much. This could be accomplished not only by random projection though. If we’d interpret large vectors as a coherent “event” we can represent them by the coefficients of wavelets, built from individual observations. The only requirement is that the observations consist from a sufficiently large number of variables, typically n>500.

Compression is particularly useful, if the properties, i.e. the observed variables do not bear much semantic value in itself, as it is the case in image analysis, analysis of raw sensory data, or even in case of the modeling of textual information.

Conclusion

In this small essay we described five ways to reduce large sets of variables, or “assignates” (link) as they are called more appropriately. Since for pragmatic reasons a petitio principii can’t be avoided in attempting such a reduction, mainly due to the inevitable fact that we need a method for it, the reduction should be organized as a process that decreases the uncertainty in assigning a selection probability to the variables.

Regardless the kind of mechanism to associate observations into groups and forming thereby the prototypes, a separation of transformation and association is mandatory for such a recurrent organization being possible.

Notes

1. Quine [1]

2. see: the abstract model, modeling and category theory, technical aspects of modeling, transforming data;

3. The “Laplacean Demon” refers to Laplace’s belief that if all parts of the universe could be measured the future development of the universe could be calculated. Such it is the paradigmatic label for determinism. Today we know that even IF we could measure everything in the universe with arbitrary precision we (what we could not, of course) we even could NOT pre-calculate the further development of the universe. The universe does not develop, it performs an open evolution.

4. Rene Thom [2] was the first to explicate the mathematical theory of folds in parameter space, which was dubbed “catastrophe theory” in order to reflect the subjects experience moving around in folded parameter spaces.

5. Alan Turing not only laid the foundations of deterministic machines for performing calculations; he also derived as the first one the formal structure of self-organization [3]. Based on this formal insights we can design the degree of creativity of a system.

impossibility to know for sure is the first and basic reason for culture.

6. note that determining similarity also requires apriori decisions about methods and scales, that need to be confirmed. In other words we always have to start with a belief.

7. Modeling without a purpose can’t be considered to be modeling at all. Performing a clusterization by means of some algorithm is not creating a model until we do not use it, e.g. in order to get some impression. Yet, as soon as we indeed take a look following some goal we imply a purpose. Unfortunately, in this case we would be enslaved by the hidden parameters built into the method. Things like unsupervised modeling, or “just clustering” always implies hidden targets and implicit optimization criteria, determined by the method itself. Hence, such things can’t be regarded as a reasonable move in data analysis.

8. This sheds an interesting light to the issue of “representation”, which we could not follow here.

References

[1] WvO Quine. Two Dogmas of Empiricism.
[2] Rene Thom. Catastrophe Theory
[3] Alan Turing (1956) Chemical basis of Morphogenesis

Tagged: language, machine learning, measurement, model, pre-processing, random graph

The "Putnam Program"