Context

November 19, 2011 § Leave a comment

Without context, there is nothing.

Without context, everything would be a singularized item without relations. There wouldn’t be any facts or events, there would be no information or understanding. The context provides the very basic embedding for events, the background for figures, and also hidden influences to the visible. Context could be the content-side of the inevitable mediality of being. Such, context appears as an ontological term.

Yet, context is a concept as little as ontological as any other concept. It is a matter of beliefs, cognitive capacity and convention where one draws the border between figure and ground. Or even a manifold of borders. Their is no necessity in setting a particular border, even if we admit that natural objects may form material compartments without any “cognitive” activity. Additionally, a context not only does not have borders at all, much like in topology the borderless sets, context is also a deeply probabilistic concept. In an important sense, contexts can be defined as positively definite entities only to some extent. The constraint as a way to express the context ex negativo is an important part of the concept of context. Yet, even the constraints have to be conceived as a  probabilistic actualization, as their particular actualization could be dependent on the “local” history or situation.

After all, the concept of context shares a lot with texts and writing, or, even more appropriate, with stories and narrating. As a part of a text, the context becomes subject to the same issues as the text itself. We may find grammaticality, the implied issue of acting as in the speech-act-theory, style and rhetoric, and a runaway interpretive vortex, as in Borges, or any poem. We have to consider this when we are going to choose the tools for modeling and comparing texts.

The neighborhood of texts and contexts points to the important issue of the series, and hence of time and memory. Practically spoken, in order to possibly serve as part of a context synchronicity of signs (not: signals!) have to be established. The degree of the mutual influence as well as the salience of signs is neither defined nor even definable apriori. It is the interpretation itself (understood as streaming process) that eventually forms groups of signs, figures and background by similarity considerations. Before the actual interpretation, but still from the perspective of the interpreting entity, a context is defined only in probabilistic terms. Within the process of an interpretation, now taken the position inside that process itself, the separation of signals into different signs, as well as the separation of signs into different groups, figures or background necessarily needs other “signs” as operable and labeled compounds of rules and criteria. Such “compound” entities are simply (abstract) models, brought in as types.

This result is quite important. In the definition of the concept of context it allows us to refer to signs without committing the symbolic fallacy, if the signs are grounded as operable models outside of the code of the software itself. Fortunately, self-organizing maps (SOM) are able to provide exactly this required quality.

The result provides also hints to issues in a quite different area: the understanding of images. It seems that images can not be “understood” without the use of signs, where those signs have been acquired outside of the actual process of interpreting the pixel information of an image (of course, that interpretation is not limited to descriptions on the level of pixels, despite the fact that any image understanding has to start there)

In the context of our interests here, focusing on machine-based epistemology, the concept of context is important with regard to several aspects. Most generally spoken, any interpretation of data requires a context. Of course, we should neither try to exactly determine the way of dealing with context,  nor even to define the criteria to define a particular context. In doing so, we would commit the symbolic fallacy. Any so-called ontology in computer sciences is a direct consequence of getting victimized by this fallacy.

Formalizing the concept of context does not (and can not) make any proposals about how a context has been formed or established. The formalization of context is a derived, symbolic, hence compressed view of the results of context formation. Since such a description of a context can be exported itself, the context exerts normative power. This normative power can be used, for example, to introduce a signal horizon in the population of self-organized maps (SOMs): not any SOM instance can get any message from another such instance, if contexts are used for organizing messaging between SOM instances. From a slightly shifted perspective we also could say that contexts provide the possibility to define rules that organize affectability.

In order to use that possibility without committing the symbolic fallacy we need a formalization on an abstract level. Whatever framework we use to organize single items—we may choose from set theory, topology or category theory— we also have to refer to probability theory.

A small Example

Before we start to introduce the formalization of context, we would like to provide a small example.

Sometimes, and actually more often than not, a context is considered to embed something. Let us call this item z. The embedding of z together with z then constitutes a context 𝒵, of which z is a part.Let us call the embedding E, then we could write:

𝒵 = {z, E}

Intuitively, however, we won’t allow any embedding. There might be another item p, or more generally p out of a set P, that prohibits to consider  {z, E} as 𝒵.

So we get

𝒵 ≠ {z, E, P}

or, equivalently,

𝒵 = {z, E, ¬P}

Again intuitively, we could think about items that would not prohibit the establishment of a context as a certain embedding, but if there are too much of them, we would stop to consider the embedding as a particular context. Similarly, we can operationalize the figure-ground phenomenon by restricting the length of the embedding that still would be considered as 𝒵. Other constraints could come as mandatory/probabilistic rules addressing the order of the items. Finally, we could consider a certain arrangement of items as a context even without a certain mandatory element z.

These intuitions can now be generalized and written down in a more formal way, e.g. to guide an implementation, or as we will see below, to compare it to other important formal ideas.

Components by Intuition

A context consists of four different kinds of sets, the threshold values associated them, and order relations between pairs of items of those sets. Not all of the components need to be present at the same time, of course. As we have seen, we even may drop the requirement of a particular mandatory item.

The sets are

  • – mandatory items
  • – normal items
  • – facultative items
  • – stopping items

Context, formalized

In the formal definition we do not follow the distinction of different sets as guided by intuition. A proper generalization moves the variability into mappings, i.e. functions. We need then two different kinds of mappings. The first one controls the actualization function, which reflects the relation between presence of an item and the assignment of a context. In some respect, we could call it also a “completeness function.” The second mapping describes order relations.

Such, we propose to start with three elements for a definition of the generalized context. On the upmost level we may say that a context is a collection of items, accompanied by two functions that establish the context by a combination of implying a certain order and demanding a particular completeness.

So, starting with the top level we introduce the context 𝒞 as the 3-tupel

  • 𝒞 = { Ci, A, R }

where Ci is the collection of items, A denotes the actualization function, and finally  R is a function that establishes certain relations between the item c of Ci. The items i need not be items in the sense of set theory. If a more general scope needs to be addressed, items could also be conceived as generic items, e.g. representing categories.
𝒞 itself may be used as a simple acceptance mapping

  • 𝒞: F{0,1}

or as a scalar

  • 𝒞: F { x | 0>=x>=1 }

In the second form we may use our context as basis for similarity measure!

The items c of the collection Ci have a weight property. The weight of an item is simply a degree of expectability. We call it w.

The actualization (completeness) function A describes the effect of three operations that could be applied to the collection Ci. All of those operations can be represented by thresholds.

Items c could be either

  • (i) removed,
  • (ii) non-C-items could be inserted to (or appear in) a particular observation,
  • (iii) C-items could be repeated, affecting the actual size of an observation.
  • A(1): The first case is a deterioration of the “content” of the context. This operation is modulated by the weight w of the items c. We may express this aspect as a degree of internal completeness over the collection Ci. We call it pi.
  • A(2): The second case represents a “thinning” or dilution. This affects the density of the occurrence of the items c within a given observation. We call it px.
  • A(3): The third operation of repeating items c of Ci affects the size of the observation. A context is a context only if there is some other thing than the context. Rather trivially, if the background—by definition the context—becomes figure—by definition not the context—, it is not a context any more. We may denote it simply by the symbol l. l could be given as a maximum length, or as a ratio invoking the size of C.
  • A(4) : The contrast function K, describing the differential aspect of the item sets (of the same type) between two patterns, defined as
    𝒦(x,y) = F(X ∩ Y, α(X-Y), β(Y-X)), α, β ≥ 0,
    with the possible instantiation as a ratio model
    K(a,b) = f(A ∩ B) / f(A ∩ B)+ αf(A-B)+ βf(B-A)

The last aspect of a context we have to consider is the relation R between items c. These relations are described by two functions S and D, the neighborhood function S, the dependency function D.

  • R(1) : The set of all neighborhood function S upon items c results in a partial and probabilistic serial order. One might think for instance about a context with items (v,w,x,y), where S determines a partial order such that the context gets established only if v follows x.
  • R(2) : The dependency function D(ck) imposes a constraint on pi, since it demands the actual co-occurrence of the argumented items ck.

Any formalism to express the serial order of symbolic items is allowed here, whether it is an explicit formalism like a grammar or a finite-state-automaton, or an implicit formalism like a probabilistic associative structure (ANN or SOM) accepting only particular patterns. Imposing a serial order also means to introduce asymmetries regarding the elements.

So we summarize our definition of the concept of context:

  • 𝒞 = { Ci, A, R } eq.1

where the individual terms unfold to:

  • Ci = { c (w) } eq.2, “sets, items & weights”
  • A = f( pi, px, l, K) eq.3, “actualization”
  • R = S ∩ D  eq.4, “relations”

This formal definition of the concept of context is situated on a very general level. Most important, we can use it to represent contextual structures without defining the content or the actualization of a particular instance of the concept at implementation time. Decisions about passing or accepting messages have been lifted to the operable, hence symbolic level. In terms of software architecture we can say, much like it is the case for SOM, that conditions are turned into data. In category theory we meet a similar shift of perspective, as the representability of a transformation (depictable by the “categorial triangle”) is turned into a symbol.

The items forming a context need not to be measurable on the elementary symbolic level, i.e. the items need not to form an alphabet 𝒜. We could think of pixels in image processing, for instance, or more general, any object that could be compared along a simple qualitative dimension (which could be the result of a binary mapping, of course). Yet, in the end a fixation of the measurement of the respective entity has to result in at least one alphabet, even if the items are abstract entities like categories in the mathematical sense. In turn, whenever one invokes the concept of context, this also implies any arbitrary mode of discretization of the measured “numerical” signal. Without letters, i.e. quasi-material symbols, there is no context. Without context, we would not need “letters”.

In the scientific literature, especially about thesauri, you may find similar attempts to formalize the notion of context. We have been inspired by those, of course. Yet, here we introduced it for a different purpose… and in a different context. Given the simple formalization above, we now can implement it.

Random Contexts, Random Graphs

A particular class of concepts we would like to mention here briefly, because they are essential for a certain class of self-organizing maps that have been employed in the so-called WebSom project. This class of SOMs could be described as two-layered abstracting SOM. For brevity, let us call them 2A-SOM here.

2A-SOM are used for the classification of texts with considerable success. The basic idea is to conceive of texts as a semi-ordered set of probabilistic contexts. The 2A-SOM employs random contexts, which are closely related to random graphs.

A particular random context is centered around a selected word that occurs several times in a text (or a corpus of texts). The idea is quite simple. Any of the words in a text gets a fingerprint vector assigned, consisting of random values from  [0..1], and typically of a minimal length of 80..100 positions. To build a random context one measures all occurrences of the targeted word. The length of the random context, say L(rc), is set as an odd number, i.e. L(rc) = 2*n+1, where the targeted word is always put to the center position. “n” then describes the number of preceding/succeeding positions for this word. The random context then is simply the superposition of all fingerprint vectors in the neighborhood of the targeted word. So it should be clear that a random context describes all neighborhoods of a text (or a part of it) in a single set of values.

With respect to our general notion of context there are some obvious differences to the random context as used in 2A-SOM:

  • – constant length
  • – assumption of zero knowledge: no excluding items can be represented, no order relations can be represented;

An intermediate position between the two concepts would introduce a separate weighting function W(0,1) ↦ {0,1}, which could be used to change the contribution of a particular context to the random context.

The concept of context as defined here is a powerful structure that provides even the possibility of a translation into probabilistic phrase structure grammar, or equivalently, into a Hidden-Markov-Model (HMM).

Similarity and Feature Vectors

Generalized feature vectors are an important concept in predictive modeling, especially for the task of calculating a scalar that represents a particular similarity measure. Generalized feature vectors comprise both (1) the standard vector, which basically is a row extracted from a table containing observational data about cases (observations), and (2) the feature set, that may differ between observations. Here, we are interested in this second aspect.

Usually, the difference of the set of features taken from two different observations is evaluated under the assumption that all the features are equally important. It is obvious that this is not appropriate for many cases. One possibility to replace the naive approach that treats all items in the same way is the concept of concept as developed here. Instead of simple sets without structure it is possible to use weights and order relations, both as dynamic parameters that may be adjusted during modeling. In effect, the operationalization of similarity can be changed while searching for the set of appropriate models.

Concerning the notion of similarity, our concept of context shares important ideas with the concept proposed by Tversky [1], for instance the notion of asymmetry. Tversky’s approach is, however, much more limited as compared to ours.

Modeling and Implementation

Random contexts as well as structured probabilistic contexts as defined above provide a quite suitable tool for the probabilization of the input for a learning SOM. We already have reasoned in the chapter about representation that such probabilization is not only mandatory, it is inevitable: words can’t be presented (to the brain, the mind or a SOM) as singularized “words”: they need context, the more the better, as philosophical theories about meaning or those about media suggest. The notion of context (in the way defined above) is also a practicable means to overcome the positivistic separation of syntax, semantics and pragmatics, as it has been introduced by Morris [2]. Robert Brandom in his inferentialist philosophy labeled “expressive reason” denies such a distinction, which actually is not surprising. His work starts with the primacy of interpretation, just as we do [3].

It is clear that any representation of a text (or an image) should always be started as a context according to our definition. Only in this case a natural differentiation could take place from symmetric treatment of items to their differentiated treatment.

A coherent object that consists of many parts, such like a text or an image, can be described as a probabilistic “network” of overlapping (random) contexts. Random contexts need to be used if no further information is available. Yet, even in the case of a first mapping of a complicated structure there is more information available than “no information.” Any further development of a representation beyond the zero-knowledge approach will lead to the context as we have defined it above.

Generalized contexts may well serve as a feasible candidate for unifying different approaches of probabilistic representation (random graphs/contexts) as well as operationalizations of similarity measures. Tversky’s feature-set-based similarity function(al) as well as feature-vector-based measures are just particular instances of our context. In other words, probabilistic representation, similarity and context can be handled using the same formal representation, the difference being just one of perspective (and algorithmic embedding). This is a significant result not only for the practice of machine-based epistemology, but also for philosophical questions around vagueness, similarity and modeling.

This article was first published 19/11/2011, last revision is from 30/12/2011

  • [1] Amos Tversky (1977), Features of Similarity. Psychological Review, Vol.84, No.4. available online
  • [2] Charles Morris,
  • [3] Robert Brandom, Making it Explicit, chp.8.6.2

۞

Representation

October 24, 2011 § Leave a comment

Representation always has been some kind of magic.

Something could have been there—including all its associated power—without being physically there. Magic, indeed, and involving much more than that.

Literally—if we take the early Latin roots as a measure—it means to present something again, to place sth. again or in an emphasized style before sth. else or somebody, usually by means of placeholder, the so-called representative. Not surprising then it is closely related to simulacrum which stands for “likeness, image, form, representation, portrait.”

Bringing the notion of the simulacrum onto the table is dangerous, since it refers not only to one of the oldest philosophical debates, but also to a central one: What do we see by looking onto the world? How can it be that we trust the images produced by our senses, imaginations, apprehensions? Consider only Platon’s famous answer that we will not even cite here due to its distracting characteristics and you can feel the philosophical vortices if not twisters caused by the philosophical image theory.

It is impossible to deal here with the issues raised by the concepts of representation and simulacrum in any more general sense, we have to focus on our main subject, the possibility and its conditions for machine-based epistemology.

The idea behind machine-based epistemology is to provide a framework for talking about the power of (abstract and concrete) machines to know and to know about the conditions of that (see the respective chapter for more details). Though by “machine” we do not understand a living being here, at least not apriori, it is something produced. Let us call the producer in a simplified manner a “programmer.” In stark contrast to that, the morphological principles of living organisms are the result of a really long and contingent history of unimaginable 3.6 billion years. Many properties, as well as their generalizations, are historical necessities, and all properties of all living beings constitute a miraculous co-evolutionary fabric of dynamic relations. In case of the machine, there are only little historic necessities, for the good and the bad. The programmer has to define necessities, the modality of senses, the chain of classifications, the kind of materiality etc.etc. Among all these decisions there is one class that is predominantly important:

How to represent external entities?

Quite naturally, as “engineers” of cognitive machines we can not really evade the old debate about what is in our brains and minds, and what’s going on there while we are thinking, or even just recognizing a triangle as a triangle. Our programmer could take a practical stance to this question and reformulate it as: How could she or he achieve that the program will recognize any triangle?

It needs to be able to distinguish it from any other figure, even the program never has been confronted with an “ideal” template or prototype. It also needs to identify quite incorrect triangles, e.g. from hand drawings, as triangles. It even should be able to identify virtual figures, which exist only in their negativity like the Kanizsa-triangle. For years, computer scientists proposed logical propositions and shape grammars as a solution—and failed completely. Today, machine learning in all its facets is popular, of course. This choice alone, however, is not yet the solution.

The new questions then have been (and still are): What to present to the learning procedure? How to organize the learning procedures?

Here we have to care about a threatening misunderstanding, actually of two misunderstandings, heading from opposite directions to the concept of “data.” Data are of course not “just there.” One needs a measurement device, which in turn is based on a theory, then on a particular way to derive models and devices from that theory. In other words, data are dependent on the culture. So far, we agree with Putnam about that. Nevertheless, given the body of a cognitive entity, that entity, whether human, animal or machine, finds itself “gestellt” into a particular actuality of measurement in any single situation. The theory about the data is apriori, yet within the particular situation the entity finds “raw data.” Both, theory and data impose severe constraints on what can be perceived by or even known to the cognitive entity. Given the data, the cognitive entity will try to construct diagnostic / predictive models, including schemes of interpretations, theories, etc.  The important question then is concerned about the relationship between apriori conditions regarding the cognitive entity and the possibly derived knowledge.

On the other hand, we can defend us against the second misunderstanding. Data may be conceived as (situational) “givens”, as the Latin root of the word suggests. Yet, this givenness is not absolute. Somewhat more appropriate, we may conceive data as intermediate results of transformations. This renders any given method into some kind of abstract measurement device. The label of “data” we usually just use for those bits whose conditions of generation we can not influence.

Consider for instance a text. For the computer a text is just a non-random series of graphemes. We as humans can identify a grammar in human languages. Many years, if not decades, people thought that computers will understand language as soon as grammar has been implemented. The research by Chomsky [1], Jackendoff [2] and Pinker [3], among others, is widely recognized today, resulting in the concepts of phrase structure grammar, x-bar syntax or head-driven syntax. Yet, large research projects with hundreds of researchers (e.g. “verbmobil”) did not only not reach the self-chosen goals, they failed completely on the path to implement understanding of language. Even today, for most languages there is no useful parser available, the best parser for German language achieves around 85-89% accuracy, which is disastrous for real applications.

Another approach is to bring in probabilistic theories. Particularly n-grams and Markov-models have been favored. While the first one is an incredibly stupid idea for the representation of a text, Markov-models are more successful. It can be shown, that they are closely related to Bayes belief networks and thus also to artificial neural networks, though the latter employ completely different mechanism as compared to Markov-models. Yet, from the very mechanism and the representation that is created as/by the Markov-model, it is more than obvious that there is no such thing as language understanding.

Quite obviously, language as text can not be represented as a grammar plus a dictionary of words. Doing so one would be struck by the “representational fallacy,” which not only has been criticized by Dreyfus recently [4], it is a matter of fact that representationalist in machine learning approaches failed completely. Representational cognitivism claims that we have distinct image-like engrams in our brain when we are experiencing what we call thinking. They should have read Wittgenstein first (e.g. About Certainty), before starting expensive research programs. That experience about one’s own basic mental affairs is as little directly accessible as any other thing we think or talk of. A major summary of many subjections against the representationalist stance in theories about the mind, as well as a substantial contribution is Rosenfield’s “The Invention of Memory” [6]. Rosenfield argues strongly against the concept of “memory as storage,” in the same venue as Edelman, to which we fully agree.

It does not help much either to resort to “simple” mathematical or statistical models, i.e. models effectively based on an analytical function, as apposed to models based on a complex system. Conceiving language as a mere “random process” of whatsoever kind simply does not work, let it be those silly n-grams, or sophisticated Hidden Markov Models. There are open source packages in the web you can use to try it yourself.

But what then “is” a text, how does a text unfold its effects? Which aspects should be presented to the learning procedure, the “pattern detection engine,” such that the regularities could be appropriately extracted and a re-presentation could be built? Taking semiotics into account, we may add links between words. Yet, this involves semantics. Peter Janich has been arguing convincingly that the separation of syntax and semantics should be conceived of as just another positivist/cyberneticist myth [5]. And on which “level” should links be regarded as significant signals? If there are such links, any text renders immediately into a high-dimensional non-trivial and above all dynamic network…

An interesting idea has been proposed by the research group around Teuvo Kohonen. They invented a procedure they call the WebSom [7]. You can find material in the web about it, else we will discuss it in great detail within our sections devoted to the SOM. There are two key elements of this approach:

  • (1) It is a procedure which inherently abstracts from the text.
  • (2) the text is not conceived—and (re-)presented—as “words”, i.e. distinct lexicographical primitives; instead words are mapped into the learning procedure as a weighted probabilistic function of their neighborhood.

Particularly seminal is the second of the key properties, the probabilization into overlapping neighborhoods. While we usually think that words a crisp entities arranged into a structured series, where the structure follows a grammar, or is identical with it,  this is not necessarily appropriate, even not for our own brain. The “atom” of human language is most likely not the word. Until today, most (if not all people engaged in computer linguistics) think that the word, or some very close abstraction of it, plus some accidentia, forms the basic entities, the indivisible of language.

We propose that this attitude is utterly infected by some sort of pre-socratic and romantic cosmology, geometry and cybernetics.We even can’t know which representation is the “best”, or even an appropriate one.  Even worse, the appropriateness of the presentation of raw data to the learning procedure via various pre-processors and preparation of raw data (series of words) is not independent from the learning procedure. We see that the problems with presentation and representation reach far into the field of modeling.

Despite we can’t know in principle how to perform measurements in the most appropriate manner, as a matter of fact we will perform some form of measurement. Yet, this initial “raw data” does not “represent” anything, even not the entity being subject of the measurement. Only a predictive model derived from those observations can represent an entity, and it does so only in a given context largely determined by some purpose.

Whatsoever such an initial and multiple presentation of an entity will look like, it is crucial, in my opinion, to use a proababilized preparation of the basic input data. Yet, components of such preparations not only comprise the raw input data, but also the experience of the whole engine, i.e. a kind of semantic influence, acquired by learning. Further (potential) components of a particular small section of a text, say a few words, are any kind of property of the embedding text, of any extent. Not only words as lexemes, but also words as learned entities, as structural elements, then also sentences and their structural (syntactical)) properties, semantic or speech-pragmatic markers, etc.etc. and of course also including a list of properties as Putnam proposed already in 1979 in “The meaning of “Meaning” [8].”

Taken together we can state that the input to the association engine are probabilistic distributions about arbitrarily chosen “basic” properties. As we will see in the chapter on modeling, these properties are not to be confused with objective facts to be found in the external world. There we also will see how we can operationalize these insights into implementation. In order to enable a machine to learn how to use words as items of a language, we should not present words in their propositional form to it. Any entity has to be measured as a entity from a random distribution and represented as a multi-dimensional probability distribution. In other words, we deny the possibility to transmit any particular  representation into the machine (or another mind as well). A particular manifold of representations has to built up by the cognitive entity itself in direct response to requirements of the environment, which is just to be conceived as the embedding for “situations.” In the modeling chapter we will provide arguments for the view that this linkage to requirements does not result in behavioristic associativism, the simple linkage between simulus and response according to the framework proposed by Watson and Pawlow. Target-oriented modeling in the multi-dimensional case necessarily leads to a manifold of representations. Not only the input is appropriately described by probability distributions, but also the output of learning.

And where is the representation of the learned subject? How does it look like? This question is almost sense-free, since it would require to separate input, output, processing, etc. it would deny the inherent manifoldness of modeling, in short, it is a deeply reductionist question. The learning entity is able to behave, react, anticipate, and to measure, hence just the whole entity is the representation.

The second important anatomical property of an entity able to acquire the capability to understand texts is the inherent abstraction. Above all, we should definitely not follow the flat world approach of the positivist ideology. Note, that the programmer should not only not build a dictionary into the machine; he also should not pre-determine the kind of abstraction the engine develops. This necessary involves internal differentiation, which is another word for growth.

  • [1] Noam Chomsky (to be completed…)
  • [2] Jackendoff
  • [3] Steven Pinker 1994?
  • [4] Hubert L Dreyfus, How Representational Cognitivism Failed and is being replaced by Body/World Coupling. p.39-74, in: Karl Leidlmair (ed.), After Cognitivism: A Reassessment of Cognitive Science and Philosophy, Springer, 2009.
  • [5] Peter Janich. 2005.
  • [6] Israel Rosenfield, The Invention of Memory: A New View of the Brain. New York, 1988.
  • [7] WebSom
  • [8] Hilary Putnam, The Meaning of “Meaning”. 1979.

۞

Where Am I?

You are currently browsing entries tagged with grammar at The "Putnam Program".