Implications of Computational Theories of Vision

Cognition and Brain Theory, 7 (1), 1984, 1-23.

Austen Clark

L. L. Thurstone Psychometric Lab

University of North Carolina at Chapel Hill

Marr's computational theory of stereopsis is shown to imply that human vision employs a system of representation which has all the properties of a number system. Claims for an internal number system and for neural computation should be taken literally. I show how these ideas withstand various skeptical attacks, and analyze the requirements for describing neural operations as computations. Neural encoding of numerals is shown to be distinct from our ability to measure visual physiology. The constructs in Marr's theory are neither propositional nor pictorial, and provide a counter example to many commonly held dichotomies concerning mental representation.

Keywords: stereopsis, computational theories, number systems, algorithms, visual physiology, measurement, mental representation.

False dichotomies bedevil the study of mental representation. The grandest is the dichotomy between propositional and pictorial representation--between discursive and depictive, between words and pictures. Various defining properties are given for both sides of the dichotomy. The key idea is that every psychological construct which can properly be called a representation falls on one or the other side of the divide, and has either properties of words or of pictures.

I call the word-picture dichotomy a false one because one can exhibit psychological constructs which are clearly representations yet which have neither the properties of discursive representation nor the properties of depictive representation. In this paper I shall argue that some of the representations employed in current computational theories of vision are neither propositional nor pictorial, but instead represent the way numerals in a number system do. Current computational theories of vision imply that the visual system employs a system of representation which has all of the properties of a number system. Use of a number system will be shown to be distinct from both discursive and depictive representation. The examples I use all derive from the work of David Marr (see Marr, 1982).

Before proceeding it is necessary to clarify the terms of the propositional-pictorial dichotomy. Propositional representation has the following defining characteristics. First, each instance or token of a proposition is a syntactic composite, formed by concatenating tokens of atomic symbols (for example, words). There are a finite number of different types of atoms (such as noun, verb, adjective, and so on). Rules of formation differentiate well-formed combinations from illicit ones by specifying legal orderings among syntactic types. Discursive representation also has defining semantic characteristics. Each atomic symbol is associated with an extension or reference. One can provide rules for evaluating the reference of composite expressions from the reference of atoms. Each proposition has a subject-predicate form; it contains an atom which refers (to some individual or set) and an atom which ascribes (some property or relation). The prototype is a proposition ascribing a property to an individual.

Pictorial or depictive representation differs from propositions in both its syntax and its semantics. Tokens of pictures are two dimensional arrays instead of strings of atoms arranged in a particular order. Projective geometry maps each point of the picture onto some point of the thing pictured. This gives picture tokens a different mode of semantic composition from that employed in sentences. Features of parts of the picture resemble features of the thing pictured, and relations among parts of the picture map onto relations among parts of the thing. Unlike propositions, every point of a picture conveys some sense. Each picture point refers to the point on the thing to which it is related by projective geometry, and it conveys features of that thing by resemblance. In a sentence, however, arbitrary parts (such as parts of words or single letters) may have no sense. The mode of composition of composite meaning from atomic parts differs for pictures and propositions.

Dichotomies become bedeviling when they constrain one's thinking. Sentences and pictures are such clear exemplars that one may think that all representation is like one or the other. If it is not a sentence, it is a picture. It either employs compositional semantics or projective isomorphism. It either names its referent, or resembles it; it is either discrete or continuous, and so on.

I shall argue that some of the constructs in David Marr's computational theories of vision provide counter-examples to these dichotomies, and are examples of a kind of mental representation which is neither propositional nor pictorial. Marr's accounts imply that the visual system employs representations which have all the properties of a number system. The visual system literally measures features of light and maps them into numeric codes. Those codes are numeric because they satisfy the postulates of elementary number theory, and are therefore isomorphic to rational numbers. I shall argue that a numeric mode of representation is distinct from both propositional and pictorial representation, and has some of the features of both.

This idea will be developed in stages. First I will describe some features of a computational account, and show how those features imply that the constructs satisfy the postulates of elementary number theory. Second, I need to show that the constructs in a computational theory are genuinely representational, and then that they are neither descriptive nor depictive. To do this requires answering the numerous skeptical attacks on the idea of mental representation. Skeptics challenge the claim that the nervous system forms explicit representations of visual features, and literally computes further representations from them. I show how these skeptical challenges can be met, and what is entailed in claiming that the visual nervous system employs symbolic representations of features, and literally computes further descriptions from those representations. Finally, after showing that the constructs in a computational theory are numeric, and are representations, I will argue that they have some but not all of the features of both propositional and pictorial representation. Numeric codes will thus be shown to be a distinct kind of representation.

To demonstrate the logical features of representation and computation, I shall use as an example a computational theory of human stereopsis and depth perception first proposed by Marr and Poggio (1977, 1979) and extended by Grimson (1981).

The basic problem addressed by theories of depth perception is: how do humans extract information about the spatial relations and distances of three dimensional objects from light impinging on two dimensional retinas? It is presumed that we do in fact extract information about three dimensional spatial relations: that we can see whether or not one object is further away from us than another, and judge relative distances of different objects. It is also presumed that the initial sensory input arrives at retinal surfaces which can be described as two dimensional. A computational theory is one which conceives of the visual system as an information processing system, that begins with an initial representation of the information available at the retina, and computes from that a further representation, which is then available for further computations. Within this approach, the question of depth perception will have been solved when we find a series of internal representations of visual information, whose initial member encodes features of light impinging on the retina, whose terminal member is a viewer-centered representation of distances of visible three dimensional objects, and in which given any one of the representations within the sequence, its successor is generated by a computation (algorithm) which we can understand. Given the initial input at retinal surfaces, such a theory will describe a sequence of transformations of symbolic representations which terminates in a representation of depth relations among objects.

Marr describes three different levels of investigation for a computational theory. The first and most abstract level is an understanding of the computational problem to be solved. One specifies the computational problem by describing the initial information available, the terminal information extracted, and the characteristics of the transformation mapping one to the other. In the case of stereopsis, the computation problem is to derive a representation of three dimensional depth from two dimensional receptor surfaces. The second level of description is to define a sequence of representations and algorithms which could carry out the given transformation. Many different algorithms could perform a given transformation. The theorist must define the representations and provide a sequence of unambiguous operations, which if carried on long enough are logically bound to give the answer in finite time. Finally, the third and least abstract level of description is that of the actual implementation of a given series of algorithms. One describes how the hardware is put together and how the representations are implemented within it.

Two features of the computational approach are critical to implicating numbers. The first number-implicating feature is that a computational theory explicitly ascribes representational character to its constructs. The constructs are described as discrete symbolic representations of visual information. As Marr (1976) says,

Perhaps the most novel aspect of these ideas is the notion that the primal sketch exists as a distinct and circumscribed symbolic entity, computed autonomously from the image, and operated on repeatedly by a number of local geometrical processes (1976, p. 516).

Such representations are given various properties and characteristics, are produced by computations on inputs, and are the objects of later transformations. The second number-implicating feature of the computational approach is that explicit claims are entered for the algorithms used in human vision. The visual nervous system literally computes later representational tokens from earlier ones in accord with the proposed algorithm. As will be seen below, skeptics have challenged both of the critical number implicating features of computational accounts: both the claims for representation and for computation. Much of what follows will be concerned with clarifying the meaning of these key ideas and with rebutting the skeptical attacks on them.

The goal of the computational theory of stereopsis is to derive an
explicit representation of the distances between observer and objects.
While humans use many different cues to judge depth (such as visual
perspective and texture gradients), the theory focuses on just one
such cue, namely binocular disparity. When the eyes are focused on a
given point, objects not on the same depth plane as that point project
to non-corresponding places on the two retinas. By measuring the
difference in locations (disparity) of the images, distances of
objects can be derived by straightforward trigonometric relationships.
If a given image projects to places at distances *d *and *d*'
from the center of the eye, the focal length of the eye is *f*,
and the width between the eyes is *w*, then the distance of the
object projecting those images is proportional to (*w* * *f*)/(*d*
- *d*'). The term (*d* - *d'*) is the disparity--the
difference in retinal locations--and to derive depth, terms for the
focal length of the eye and the distance between eyeballs must also be
taken into account. Of course focal length and inter-eyeball distance
are constants, so if disparity can be found, the distance of the
object can be as well, by dividing it into the constant term (*w*
* *f*).

Such equations would be of little interest were it not for the demonstration that humans use binocular disparity cues in depth perception (for distances up to 135 meters). Furthermore, the random dot stereogram shows that depth is perceived even when retinal disparity is the only cue to depth. The inescapable inference is that the nervous system somehow derives and employs retinal disparity--the difference in locations of projection points for a matched image. My argument will be that any theory which can explain how this is done is forced to posit numeric codes, whether implicitly or explicitly. To use disparities to obtain depth, the nervous system must do something akin to division. To derive disparities, the nervous system must do something akin to subtraction (of retinal locations). One can show that the kinship is, in fact, isomorphism, and hence that the codes are numeric. If disparities are coded at all, then the codes are a number system.

The random dot stereogram poses a staggering computational problem. In order to calculate a disparity, we must find the two retinal positions for a given feature, then subtract. To find the two retinal positions, we must match a feature found in one retina with one found in the other. To find a match for a given point in one retinal image, we must search the other retinal image, since the edge images will not be in the same place. Each potential match must be evaluated for goodness of fit and the best potential match selected. Once features are matched, a disparity can be computed, and once a disparity is computed, the distance to the given object can be computed.

Matching is computationally expensive, since for each point in the left eye image one may have to search the entire right eye image. Since a retina has on the order of ten to the sixth receptors, a worst case search has ten to the twelfth possible matches (each receptor to every receptor in the other eye). The key to the stereopsis algorithm is to choose what to match so as to limit the search. The feature matched should correspond to an unambiguous feature of an object, so that the disparity gives object depth. Marr proposes that the system matches edges--changes in light intensity between adjacent regions in the image. Changes in intensity of sufficient magnitude are either reflectance edges or occluding contours, and so have an unambiguous depth. Marr proposes an explicit representation of large changes in intensity in adjacent retinal locations, as part of the representation called the 'primal sketch.' Along with location of intensity changes, the primal sketch characterizes the orientation, contrast, width, length, and terminations of each edge.

The next problem is to isolate edges in the image. The initial encoding of visual information is called a 'gray level image' and consists of an array of intensity values. Each retinal receptor defines a receptive field within a two dimensional array of such fields, and the output of each receptor is a measure of the intensity of light within that field. A 'pixel' is one entry in the array of intensity values. In the human visual system it presumably corresponds to the receptor potential in one rod or cone.

To transform gray level image to primal sketch, Marr and Hildreth (1980) propose an edge operator algorithm. To derive edges from the gray level image, one attempts to match a template for an edge with every point in the image, then locates places where the match is best. The operator is essentially a template of how gray level values should look near an edge. The degree of fit between a gray level image area and the template is a measure of edginess in that area. To get degree of fit, we convolve (in effect, correlate) the gray level image with the edge operator, as follows. For each point in the gray level image, the template is centered at that point. Then one multiplies every pixel value in the template by the pixel value found in the corresponding gray level image. These products are summed over all the pixels in the template, and the result assigned to a new image at the pixel corresponding to the center point for the template. Each pixel value in the convolved image is therefore a function of many pixel values in the original image and of the template used. Each pixel value in the convolved image expresses the correlation of surrounding points in the gray level image with the edge template.

Marr and Hildreth propose use of a specific set of operators. All of them have a similar form, but are sensitive to edges of different widths. The profile of each template is a second derivative of a Gaussian curve, and has a profile somewhat like a Mexican hat. A set of these with different widths is convolved with the gray level image, and zero points in the result provide the edges of the primal sketch. Once edges are located, disparities can be easily calculated. The end result is a viewer-centered representation of distance for each edge in the primal sketch.

As I said above, Marr's claim is not just that the proposed representations and algorithms provide a solution to the computational problem of vision, but that they provide the solution employed by humans. One way in which this sort of existence claim is put to work is by implementing the system in a computer, and then comparing descriptions it generates with human psychophysics. For example, Grimson (1981) employed various types of random dot stereograms as inputs for a machine implementation of the stereo algorithm and stimuli for humans. He compared the program disparity maps with the human ability to get stereo fusion. The key finding was that those two sets of data correlated well and provided evidence that the same algorithm was used by human and machine.

A second way of making existence claims for representations and
algorithms is to provide them with a neurophysiological
identification. One describes the neural parts performing the given
computations and the neural states comprising the given
representations. Implementation details for the various theories are
sketchy and are best viewed as speculative. For example, Marr and
Hildreth (1980) speculate that lateral geniculate *X* cells carry
the convolution of image with Gaussian operators. They suggest that
simple cells in the striate cortex have the function of locating zero
crossings. Grimson used data on bandwidth of spatial frequency
channels in human vision (Wilson and Bergen, 1979) to define pixel
width of his edge templates. Although most implementation details are
not known, and indeed study of the computational and algorithmic
issues can proceed without knowing them, the important point is that
the constructs and operations are presumed to have a neural
implementation which can ultimately be detailed. The theorists presume
that there is a neural implementation for edge finding, the primal
sketch, disparity maps, and so on: that the computational constructs
can ultimately be identified neurophysiologically.

If a computational account of vision is true, then the codes employed by the visual system constitute an algebraic system isomorphic to the rational number system. Another way to put this is that the codes satisfy the postulates of elementary number theory and provide a standard (isomorphic) model for the rational numbers. Note that the codes are not numbers (which are abstract entities and not neural states) nor do they necessarily name numbers (their interpretation is within vision and not within set theory). The point is that the codes satisfy the logical relational structure defining numbers and hence provide a model for the rationals.

What conditions must a set meet to provide a model for the
rationals? First, it must be a 'well ordered integral domain.' This is
a characterization of numbers deriving from abstract algebra (see
Stoll, 1963, Chap. 8). An integral domain is a 3-tuple <*N*, *R*,
0>, where *N* is a set, *R* is a binary operation on that
set, and 0 is a element of the set *N*. For the domain to be well
ordered, the relation *R* must meet certain requirements. It is a
one-to-one mapping from *N* to *N* - {0} (i.e., excluding
zero). *R* is transitive, reflexive, and antisymmetric. The
element 0 is its initial member (no member *x* of *N* bears *R*
to 0) and every subset of *N* has an initial *R* member.
Finally, a subset of *N* is identical to *N* if it includes
zero and includes the *R* image of every member it includes.

Any integral domain with this structure provides a standard model
for natural numbers. The relation *R* defines a progression,
whose intended interpretation is 'less than or equal to', and within
which the arithmetic relations of less than and equality can be
defined. We can also define a unique function within the integral
domain for addition, and a unique function for multiplication.
Addition is defined by the two characteristics:

(1) For allxinN,x+ 0 =x.

(2) For allx,yinN,x+ (theRofy) = theRof (x+y).

There is just one function meeting these two conditions.
Multiplication can be defined in an analogous way. One can then prove
all of the ordinary arithmetic properties for the operations of
addition and multiplication. That is, functions defined in that way
are associative, commutative, obey cancellation laws, and so on.
Rational numbers can be defined as an extension of natural numbers;
that is, as an integral domain meeting further conditions. First, to
be assured of a solution for *x* for all equations of the form *b*
+ *x* = *a*, we must define integers, as ordered pairs <*b*,
*a*> of natural numbers, providing solutions to such
equations. For example, the equation 3 + *x* = 2 defines the
integer -1. To be assured of a solution for *x* for all equations
of the form *b* * *x* = *a*, we define rationals, as
ordered pairs of integers <*b*, *a*>. The solution for
3 * *x* = 2 requires the rational root 2/3.

An algebraic system is just a characterization of a set and of operations and relations defined over that set. Other axiomatic characterizations of numbers employ different primitives, a less developed logical calculus, and different axioms. For example, one can begin with a calculus for first order logic. If one adds identity, second order predicates (for classes of classes), quantification over classes, and primitives for zero, successor, and ancestrals of relations, natural numbers can be defined, along with operations of addition and multiplication (Carnap, 1939, p. 41). One can take addition and multiplication as primitives, and define zero, successor, and number in terms of them (Quine, 1969, p. 109). The important point is that all of these axiomatizations have rational numbers as their intended model, and they specify relations and properties any model must have if it is to be isomorphic to numbers. As Quine (1969, p. 81) says, "Any objects will serve as numbers so long as the arithmetical operations are defined for them and the laws of arithmetic are preserved."

The codes defined by computational theories of vision are numeric
because they have arithmetic operations defined for them which
preserve the laws of arithmetic. The set of codes is a well ordered
integral domain. For example, no matter how disparities are coded
neurophysiologically, we know that those codes must meet the following
conditions. For every disparity code there is a theoretically possible
greater one. The relation "greater than or equal to" well
orders the codes. Disparities are solutions to equations of the form *b*
+ *x* = *a*, and so include the negative numbers. Depth is
derived from disparity by solving an equation of the form *b* * *x*
=* a*, so the codes are closed under division, and hence comprise
the rationals. To compute depth from disparities, there must be an
operation of multiplication defined for whatever codes disparities are
coded in. No matter how disparity is coded physiologically, because
those codes must meet such conditions, they provide a model for the
rational numbers: they are a number system.

The key to this argument is that the algorithms which are specified to link succeeding representations imply that those representations satisfy the above numeric properties. Since the theory explicitly claims that those algorithms provide the solution used by humans, the theory implies that the representations employed by the human visual system are number-like.

Consider some of the computations defined in the theory.
Convolution of the gray level image with an edge template is an
operation of multiplication of pixel values followed by addition. We
have two internal symbolic representations (the gray level image and
the primal sketch), related by a computation. If the computation is
really one of convolution, then no matter what the codes are for the
gray level image and the primal sketch, there must be operations of
addition and multiplication defined for those codes. To compare
goodness of fit across several matches, there must be a relation
defined among the codes for less than or equal to. Suppose for example
that lateral geniculate *X* cells carry the convolution of gray
level image with Gaussian edge operators. The output may be carried in
several different ways: as frequencies of spikes, rate of change of
frequencies, latencies, or microstructure codes. Whatever the neural
states comprising the codes are, the theory requires that they meet
the following conditions. Codes can be ordered as greater or lesser,
so that goodness of fit can be assessed. There exists a zero value.
Between any two codes there is an intermediate theoretically possible.
Convolving a template and finding best fits never yields a non-code
value, so the codes are closed under addition and multiplication.
There is a code corresponding to a solution *x* for each equation
of the form *b* + *x* = *a* and *b* * *x* = *a*,
so that neural codes for disparity and for depth can be derived. Hence
those neural states, no matter what they are, satisfy the relational
structure of the rational numbers and constitute a number system.

The codes must be 'numeric' in the sense of exhibiting all structural properties of numbers. The content, format, and implementation of the codes is left open. However, by specifying computations performed on the codes, the theorist presupposes that, whatever the codes are, they satisfy the logical relational structure of a number system. The computations require operations of addition and multiplication, relations of less than and equals, and a zero code. If the system must have all that, it must operate with a system of representation isomorphic to the number system. Note that this constraint is stronger than that imposed by merely specifying mathematical functions between succeeding representations. Mathematical functions are defined over numbers, while computations are only defined over representations (in this case: numerals). In defining a function it is irrelevant how numbers are represented--the names, whether in decimal or binary or octal notation, are irrelevant to whether a given relation among numbers holds. A computation of a function, however, operates on numerals, and depends critically on how numbers are represented. Since we have certain operations in the algorithm we must have certain structural properties in the codes: namely those defining a number system.

To clarify the idea just proposed, it will be necessary to work through a variety of objections. The fundamental objection is that the theory given does not establish that the visual system literally operates with representations of features of light or that it literally computes successive representations from earlier ones. The objections are all various ways of claiming that the idea of visual computation or visual representation is a confusion, and that the neurons involved in vision do not represent anything or compute anything. In this section I will explore the possibility that postulating a numeric code within vision is just a symptom of hypostatizing our measures. Section 4 will examine the distinction between tokens of numerals and measurements of internal states. Section 5 will consider issues of analog computation in vision. Section 6 will examine the argument that the algorithms proposed by Marr are meant merely as analogies, and that neurons do not literally 'compute.' Section 7 will explore the idea that the visual system measures features of the environment.

The first objection is as follows. The idea of visual numeric codes simply confuses the numbers and equations in a theory with number-like entities in the object the theory is about. To say that the visual system computes a convolution from the input image and an edge template is just to say that the theory contains a mathematical model including a convolution function. There are internal states within the system which we can describe with various equations and mathematical entities (like arrays and correlations), but that does not imply that there are mathematical entities lurking in the head described. To claim that the visual system literally computes a convolution is to reify or hypostatize our measurements: to mistakenly treat those measurements as real number-like entities in the head. We may describe many systems with numbers and equations, without implying that the system itself employs numeric description. For example, in psychophysics many magnitude estimates seem to obey a power law, to the effect that the psychological magnitude is proportional to the physical stimulus intensity raised to some power. Firing frequencies for some transducer neurons are also found to be proportional to a power of stimulus intensity. Neurons in the knee fire at a rate proportional to the flex angle raised to a power. But even though we find such equations, they do not imply that the knee neuron operates with numeric codes, computes a new representation using a power function, and fires appropriately. It just works that way because of its biophysics. Given that there is no temptation to ascribe a numeric representation to the knee neuron, and that the only reason to ascribe numbers is that we find measurement and equations helpful in modeling what is going on, why ascribe representation as of quantity to the visual system? What is the difference between it and the knee neuron?

The answer to this objection is that there are two differences between the knee neuron case and the computational models of vision. The first is that a computational account of vision explicitly posits a sequence of internal representational states. It claims that there are indeed number-like codes in the head. If this seems scandalous, try to account for depth from disparity without postulating codes for quantities! Whereas in the knee neuron and other psychophysics examples, there is no claim entered for the existence of representations. There is no implication that knees represent numbers and calculate power functions.

The more important difference between the cases, however, lies in the fact that in computational theories, internal tokens are linked to further consequences within the system: they form the domain for the next set of computations. In the psychophysics examples there is no further equation linking psychological magnitudes to later representations; there is no explicit claim for further information processing for knee flex angles. For firing rates of the knee neuron to be a numeric code, they would have to be used as a code, and used numerically, so that the differences across firing rates make an arithmetic difference to later processing.

This should clarify the distinction mentioned above between mathematical functions and computations. One can employ mathematical expressions and equations to represent what is going on without committing oneself to the existence of an internal number system. However, if one proposes that the visual system employs a series of computations, one is making a very different sort of claim. One cannot compute with numbers, but only with names of numbers. To propose a computational model one must propose a set of representations upon which it operates. Marr claims explicitly that the visual system employs a sequence of computations to derive representations of spatial relations. The theory does not just employ equations to describe relations between states of the visual system, but proposes that the system itself computes following an algorithm.

The difference proposed between knee neuron and vision raises the question of what it means for a neural state to be used as a code. Suppose for instance that we can describe further consequences of knee neuron firing rates, using a system of equations describing further processing. Neural states at each level are related in a complicated way to those of the preceding level. Even with the system of equations, however, our measurements do not require us to postulate an internal system of representation of quantity. We can accept a complicated system of states, each related to others by mathematical equations, without needing to assume that the system employs numeric codes. Does use as a code imply anything different from description by a system of equations?

The same problem arises in vision. Given the neurophysiological and
topographic regularities in the visual system, to describe it requires
a system of equations relating successive topographic layers. A
computational model simply uses programs to identify the functions
linking successive states, instead of equations. The convolution
algorithm, for example, is simply a way of relating firing rates of *X*
cells to retinal receptor potentials. Rather than state the relation
in an equation, one describes it by writing a program. The only
numbers involved in this model are frequencies of cell firings--which
are numbers we ascribe after measurement. The sequence of states
satisfies a system of equations, so that retinal receptors are related
to *X* cell frequencies, which are presumably related to yet
other frequencies. Even if we accept a set of equations as describing
sequences of states in this way, is there any reason to suppose that
the system is explicitly operating with a number system?

A computational approach insists that there is a difference between visual representation and our representation. Claims for numeric codes and for computations performed by the visual system are at a different level of description from our use of measurements and equations to describe vision physiologically.

One way to make this point is to show a distinction between (our) measurement of internal states and (its) encoding of quantities. To measure an internal state (such as frequency of firing of striate neurons) requires a set of properties and relations which may be very different from the properties and relations presupposed when it is claimed that the internal state encodes tokens of numerals. Suppose we have a computational device which operates with tokens of numerals, so that states within it are tokens of representations of numbers. Note that a measurement functor which maps characteristics of the internal states of the device into numbers need not be the same functor as the one associating the state with a numeral. That is, the number we assign to the state when we measure it is not necessarily the number the state represents.

To see this, imagine snaking a microelectrode down into the innards of a PDP-11, and recording voltages of a register one intercepts. Voltage measurement will suffice to assign some number (probably between 0 and 5) to the register, but the register is representing some other totally different number (it uses a zero or five volt charge to code a zero or one within a binary number system). So our measurement need not tell us the number the register is representing. Internal states of the device may have many different mensurable features, only a few of which are relevant to coding numerals. The feature measured and the number represented may be totally distinct.

Of course it may be that the coding for numbers employed by the device employs measurable features of the internal states; that is, that the features used to code numbers satisfy a relational structure which we can map into numbers. But the stronger point can now be made that the characteristics and relations of internal states relied upon in order to code quantities may not even satisfy prerequisites for measurement.

To measure *X*, one must establish a mapping between instances
of *X* and numbers, and show the relational structure of numbers
preserves the relations over *X*. In particular, one must define
a relation *R* over instances of *X* which serves to order *X*.
Different kinds of ordering are possible; the weakest is one in which *R*
is transitive, antisymmetric, and connected (see Krantz, Luce, Suppes,
and Tversky, 1971). The relational structure found among members of *X*
is shown to be preserved when each is mapped to a number. One can then
use numbers to represent relations among *X* members. However,
without some ordering of *X* one cannot map *X* into
numbers. My claim is that *X* states can be tokens of numerals
without any such ordering over *X*.

For example, to code a binary zero or one, the device may employ relations over its internal states which are not transitive or connected. There may be one discrete salient kind of state which is a code for one, and everything else is a zero. Coding may not employ any relational ordering at all: the non-relational predicate "voltage greater than 5" may be the only one used to encode numerals. In our PDP-11 register, for instance, a freak voltage of six would not constitute a code for two--it would still just be the binary digit one. Hence the properties and relations required for the internal states of the device to code numbers are not the same as the properties and relations required for us to measure features of the states.

The further implication of this distinction is that the equations and functions we employ to describe relationships among states of the device may not describe the functions it is computing. If we inserted a second microelectrode into the PDP and recorded relationships among adjacent registers, we may well discover that the second voltages are a function of the first, and some complicated equation links the two. Clearly that function can be distinct from the function computed by the device. That is, the device takes the numeral token from the first register, runs it through various binary logic circuits with other tokens, and writes the result token into the second register. It may for instance be multiplying the number by three, so that

1001 * 0011 = 11011 (binary)

9 * 3 = 27 (decimal)

When we record voltages, we get voltage patterns with amplitudes of five volts. Each pattern can be Fourier analyzed as a sum of a series of sine waves of different wavelengths, and perhaps the output pattern described as a function of the two input series of wavelengths. Whatever function we find relating the output register pattern with the inputs, it certainly will not be one of multiplying by three. Relations between names of numbers are distinct from relations between numbers.

The moral of our PDP experiments is that the conditions for measuring and assigning numbers to internal states are distinct from the conditions for ascribing tokens of numerals to those states. Hence the grounds for claiming that the system operates with a system for representing numbers are distinct from our ability to measure its states and relate them with equations.

The computational assumption of distinct levels of description is thereby vindicated. There is a difference between claiming that a system employs a numeric mode of representation and claiming that a system can be described by a set of equations and measurements. To know what number the device is representing, measurement of its internal states is insufficient. To decide what number it represents or what computation it is carrying out, one must know how the internal states are mapped into a discrete set of tokens of numerals, and what the number system is (decimal, binary, octal, binary coded decimal, and so on). The whole point of ascribing a system of representation to the device is that it opens the distinction between the tokens employed by the system and what those tokens represent. Measuring the tokens will not yield their reference.

I have claimed that measurement will not necessarily give the number represented by an internal state, because the conditions for encoding are not equivalent to the conditions for measuring. However, this does not imply that encoding and measurement are always distinct. It is possible that the number represented by the internal state is the same as the number we derive by measuring the state. Perhaps the features the system employs to encode tokens are just the features we employ to map its states into numbers. This is one characteristic of an analog computer: there is some measureable feature of its internal states, and our measures of that feature can provide just the number the device represents. An analog computer is constructed so as to provide a model for the system of equations requiring solution. Its parts are arranged so that some of the variables of its operation are inter-related in exactly the way described by those equations. By careful measurement of states of the analog device, a solution for the system of equations can be found. Perhaps this is the way the visual system works, with frequencies of firing providing the analog representation. In this case is there any need to posit an internal system of representation?

In an analog computer there may be no distinction between the numbers and equations used to describe the device and the 'representations' it employs in its computations. In a digital device the relationships among the names of numbers (relationships among the codes--the voltage patterns in memory) are distinct from the relationships computed between numbers. The analog device merely instantiates equations--its states are described by them--while the digital device computes them symbolically--by manipulating encodings of numbers (Pylyshyn, 1980, pp. 129-130). To 'compute symbolically' here means to operate with encoded names of numbers in such a way that the numbers named stand in the required relation. An analog device may not have this 'symbolic' level of description. It is constructed in such a way that description of its internal states is isomorphic to description of the equations it computes. There is no difference btween equations describing its states and equations relating the numbers 'represented.'

One reason for positing representation is precisely because of the difference between descriptions of relations among states of the device and descriptions of relations among the referents of those states. If, as in a pure analog device, that difference vanishes, then there is a corresponding weakening in the justification for positing numeric codes. But does evidence of analog function preclude description at the 'symbolic' level?

Analog function is an implementation detail, and does not preclude
the existence of a symbolic and algorithmic level of description as
well. For example, a parallel adder may be constructed in which
addition for each digit is performed by an analog device (counting
spikes, for instance) yet each digit is given a distinct place. Even
though all the parts work in analog fashion, and for a given digit
adder there is no distinction between measured feature and number
represented, the full *N*-digit number added is not accessible by
measurement alone, but requires place coding. Similarly, even though
parts of the nervous system can be modeled as analog, it may still
employ discrete codings. Analog function gives no grounds for denying
a symbolic and algorithmic level.

Analog function does not preclude operation with an algorithm;
however, operation with an algorithm implies a non-analog, symbolic
level of description. If the system operates with an algorithm, then
there is a distinction between features of its states and the
referents of those states. The reason is that an algorithm is a
sequence of instructions acting on symbols. By manipulating names of
numbers, one can compute functions over numbers. Similarly, by
manipulating codes for disparity, one can compute depth. No matter
what the neural states are which underlie disparity detection, their
representational character--their status as symbolic tokens--is
guaranteed by the fact that by manipulating those states, one can
derive three dimensional depth. There would be no need to describe
stereopsis algorithmically if there were no distinction between the
measurable neural features of disparity codes and the number they
represent (*E*.g., difference in retinal location). One could
simply detail the biophysics of the system. But in the case of
stereopsis, the symbolic character of disparity codes is guaranteed by
their use in depth perception. Because one arrives at an
endpoint--perception of three dimensional depth--clearly distinct from
the physiology of disparity detectors, the symbolic character of those
neural states is assured.

The final point concerning analog vision is that discrete internal tokens are consistent with physiology. The issue of neural implementation should not be prejudged. Computational theorists are not committed to encoding all quantities as spike frequencies, and other coding schemes are possible (Bullock, 1973; Perkel and Bullock, 1968). For example, place coding employs different neurons for different features, and could easily represent discrete tokens. A simple binary feature such as increase or decrease in rate of firing can provide a discrete code. Stochastic codes use changes in average firing rates of groups of neurons to represent discrete events. Phase and latency shifts can encode information without any change in spike frequency. In these ways it is consistent with physiology to suppose that the system employs an encoding scheme for tokens of numerals which does not employ the same relational ordering we employ to measure its states.

The preceding arguments have placed a heavy burden on existence claims for algorithms in human visual processing. Because those algorithms employ operations of multiplication and addition and require a zero and a well ordered field among the codes, the codes must be numeric. Because an explicit claim for the use of algorithms is made, a system of representation is required which is distinct from our measurements of internal states. Because we have computations, we must have codes, and because we have codes, there is a distinction between numbers we use to represent what is going on and numerals the system employs in its computations.

Against both lines of argument the following objection may be raised. The computational accounts of vision should not be construed as theories, but as models; not literally, but as analogies. There is no sequence of instructions literally executed in the visual system. The human visual system does not literally convolve a gray level image with edge detection operators, nor does it derive disparities by literally subtracting positions of matched retinal images. Instead one can say only that it does something like convolution or analogous to subtraction. Those processes are like what goes on in the computer, but one need not say that humans literally use the same algorithm, or that the visual system literally adds and multiplies.

Unfortunately, I think this objection misconstrues the import of computational accounts. I believe such accounts imply that the nervous system does literally add, subtract, multiply, and, in general, compute using an algorithm, operating on numeric codes.

Are algorithms just analogies? One way to see that they are not is to note the vanishing difference between processes analogous to computations and processes which are computations. One has trouble imagining a system which employed an operation like multiplication (or addition, convolution, etc) which was not multiplication (or addition, convolution, etc). How can it be like multiplication and not be multiplication? If the operation is isomorphic to multiplication of numbers, then it is multiplication. If we meet just the two conditions:

(1) an identity element is defined, so for anyx,x* 1 =x; and

(2) a well ordering relationRis defined, such that for anyxand for anyy,x* theRofY= theRofx*y;

then we have uniquely defined multiplication; and it is hard to imagine something analogous to multiplication which does not meet those conditions. If it does not meet either of those conditions, then the operation is not even like multiplication. Conversely, if it is analogous to multiplication (in those two ways), then it is multiplication. There is no difference between analogies and existence claims in this case.

What then is the import of claiming that neural processes are algorithmic? Consider fuel injection systems in a car. What is the distinction between carburetors which employ an algorithm to compute the amount of fuel to be injected at certain times, and those which do not? Both cause fuel to be injected appropriately, and we can suppose the fuel injection behaviors of both to be the same. The difference, if any, must lie in the intervening steps to produce the behavior. In the old non-algorithmic carburetor, transitions between states can be explained most simply by mechanical laws relating those states. For example, pressure against a valve causes a spring to deform an amount determined by the force, and thereby allows an amount of liquid determined by the pressure and the valve cross section to pass down the channel. In the new algorithmic carburetor, however, the transition functions are described differently. One partitions states of the carburetor (and circuit chip) into control states and codes, and transition functions mention both states and codes. Each step depends in part on the code present. States are codes in the sense that they have a consistent relationship with engine events such as gas pedal depression, gear shifting, axle revolutions, and temperature. Furthermore, under those interpretations the transition function can be described as computing a function yielding fuel amount from engine states such as revolutions, temperature, and so on.

To formally define an algorithm, one describes a set *Q* of
control states, a set *E* of codes, and a function which maps
ordered pairs <*Q*, *E*> into <*Q*, *E*>
(Hopcroft and Ullman, 1979). The algorithm has a decidable set of
states and a decidable set of symbolic tokens. Given an input
state--token pair, the output is a unique state--token pair.

The problem is that tokens of internal codes are themselves states of the system. The distinction between transition functions referring to both machine states and codes and those referring to just machine states therefore requires further justification. In effect, in a computational account we partition states of the system into two sets (codes and non-codes) while in non-computational accounts there is no such partition. On what grounds can one partition states into codes and non-codes? Is such a partition ever necessary for description?

At the physiological level, there is no necessary property of states which distinguishes those which are tokens of something from those which are not tokens of anything. Nothing intrinsic to a token marks it as a token. To justify partitioning states into codes and non-codes, one can appeal to the economy of description possible with a computational account. One could describe the state transitions of the computational carburetor without mentioning internal codes or computations, but that function would need to specify all of the different electrical states which correspond to different representations, and for each give the unique electrical state which follows. A simpler description is possible if we can specify mappings from engine states into codes, and then define algorithms operating on those codes. An encoding function and an algorithm allows a relatively simple description of state transitions.

Consider for instance the PDP multiplication example introduced above. Voltage patterns in the memory cells encode binary numerals. Once the code is known, the relationship between cell contents can be described very simply: one number named is three times the other. Note that the relationship between states of the device is described by the relationship between the referents of those states. One could describe the sequence of states of the PDP without mentioning codes or their referents; it could be described by a complicated function mapping one voltage pattern into another. Description in terms of codes and their referents is much simpler, however. Similarly, it may be possible to give a simpler description of the physiology of vision by describing states as codes for features, and describing relations among physiological states in terms of the features represented. For example, the complex physiology of ganglion and lateral geniculate cells can be described as convolving an intensity array with second order edge operators.

The key is partitioning states of the system into codes and control
states. If this partition can be made, one can describe sequences of
states of the system as steps in an algorithm. The computationally
relevant state of the system is given in an instantaneous description,
which details the current token *E*, and the current control
state *Q*. If the appropriate transition function is found, each
instantaneous description has a unique successor. The algorithm
thereby constrains potential sequences of instantaneous descriptions
to just those which satisfy a particular function. A computation upon
a given input can then be defined as the unique sequence of
instantaneous descriptions for the device given the input.

The critical premise is that one can treat some states of the system as codes--as having reference. One reason to accept that states of the system are codes is just that one may thereby arrive at a simpler description of what the system is doing. But the more important reason was mentioned above (Section 5). Manipulation of neural states associated with retinal location and disparity yields accurate information about a different domain: three dimensional depth. The only way to account for this is to accept the idea of symbolic computation: those states have representational character (are codes), and by manipulating codes the system derives new descriptions of the domain. Once this premise is accepted, one can redescribe sequences of physiological states as steps in an algorithm. Algorithmic or computational language is simply a re-description of those sequences, making no further existence claims once codes and the appropriate transition function are accepted.

The arguments in sections 2-6 in effect characterized visual representations as an algebraic system: a set of elements over which certain operations are defined. Those operations were shown to require all the arithmetic properties of addition and multiplication, showing that the codes form a well ordered integral domain. The tokens to this point constitute an uninterpreted calculus, given a structure solely by the operations defined for them. Can we give that calculus--the codes themselves--an interpretation?

The intended interpretation for visual codes is very clear. They are thought to be measurements of features of incoming light. The referents for visual codes are features of light entering the eyes; the codes are related to those features the way a measurement is related to the thing measured. The visual system measures intensities of different wavelengths at points all over the retina. From those primitive measures more complex measurements--such as those for edge length, width, orientation, and so on--are derived by computation.

It seems relatively uncontroversial to claim that the visual system measures light intensities, and more difficult to accept the claim that it employs codes which constitute a number system. However, the former implies the latter. If the visual system measures features of light, then it must employ numeric codes. Measurement is assignment of numbers to objects in accord with a rule. To set up a system of measurement is to define a functor mapping empirical properties or relations into numbers. To do this there must be sufficient similarity of structure between the empirically found properties and relations and the properties and relations of numbers. Otherwise the number system misrepresents relations among the physical properties.

The visual system could not map features into numbers unless it employed numeric codes. The idea is that retinal transducers can be characterized by a measurement functor, and their output--receptor potentials, and later spike patterns--are measurements. If the codes yielded by this process are not numeric--are not ordered by a transitive connected antisymmetric relation, do not contain a unique zero, and so on--then one cannot make sense of the idea that the process described is measurement. Once one does have an internal number system, however, there is no formal reason for denying the possibility of measurement on the sub-personal level. Suppose there is a physical feature which displays the appropriate relational structure. For example, light intensities can be ordered by a relation which is transitive, antisymmetric, and connected. Another way of putting this is that we can measure light intensities: we can define a function mapping that feature to numbers. There is no obvious reason for denying that the nervous system could itself instantiate that function. It maps light intensities into its internal numeric code. As long as its internal tokens are an algebraic system isomorphic to numbers, and the function always maps the same feature value to the same code, the formal requirements for measurement are met.

This argument in effect provides semantic grounds for accepting an internal number system, where previous arguments have been entirely syntactic. All the earlier arguments have been in terms of the operations applied to visual codes, and have treated the codes as an uninterpreted calculus. The argument from measurement derives the arithmetic properties of codes from their relationship to light features; that is, from their reference. A semantic interpretation of codes as measures of light is only possible if the codes are numeric.

The proposed interpretation gives tokens of visual codes precise meaning. What a given code means is that some feature of ambient light exhibits such-and-such a magnitude. For example, an element within the gray level image represents the intensity of light within a retinal receptive field: it means that light is of such-and-such intensity in that field. Note that it does not represent, name, or mean a number: the domain of its interpretation is not sets of sets, but rather features of light. The codes constitute an algebraic system, whose operations are sufficient to establish its isomorphism with numbers. The interpretation of that calculus specifies a domain in ambient light and measurement functors mapping features of light into codes.

With the receipt of an interpretation, visual codes are at last admitted into the ranks of representation. There is no good reason to deny that those internal states represent. Tokens of codes are characterized syntactically, as primitives arranged in certain formal structures. Computable operations are defined for the codes, generating later representations from earlier ones. Neural implementations for the codes are suggested, in such a way that successive representations are successive in the nervous system, and the operations of particular parts of the nervous system can be characterized algorithmically. These implementation ideas also provide a plausible encoding of the initial representation (gray level image) by retinal receptors. The content of the codes is ultimately decoded, as shown by perception of depth in a random dot stereogram. Finally, a domain of interpretation can be provided for the codes, and each code can be linked to a referent by a semantic rule (in this case, a measurement functor).

The constructs in Marr's theory are representations, then, but they are neither propositional nor pictorial. First, an array of numerals does not have the syntactic form of a proposition. Each pixel value may be composed of syntactic atoms concatenated in various ways (as in digits in a number system), but the array itself has properties unlike a string of atoms. For example, distance between entries carries information (and is necessary for disparity calculations). Reference is achieved differently: there is no subject-predicate form, and there are no atoms within the token which have the job of referring. Instead a pixel value refers to a light patch because of its place in the array. Because of the topographic mapping of the nervous system, that 'place' is linked to a specific retinal location, and thereby to a particular patch of light.

A numeral itself is a syntactic composite and shows some of the features of descriptive representation. There are legal and illegal tokens, discriminable syntactically. Reference relies on semantic rules relating tokens to a domain. However, number systems differ from propositions by their potential for both semantic and syntactic density (Goodman, 1976, pp. 136, 153). That is, between any two rational numerals there is a third (syntactic density), and between any two values measured and encoded by those numerals, there is a third value (semantic density).

Density is a characteristic of pictorial representation, but Marr's constructs are missing many of the key depictive features as well. While pixel values are dense, each pixel value is a token of a numeral, and not a color patch. Pixels do not attribute properties via resemblance. Furthermore, while each pixel exhibits density, the array as a whole does not. Between adjacent pixels there is no third. Such arrays are similar to pictures in that they have two dimensions which have semantic import. Distances between pixels are critical for disparity. However, unlike the continuous points of a picture, the array has a finite number of entries, and one does not find the continuity of reference across its surface that one finds in a picture. While its relation to light features is in a sense 'projective,' it is only in the sense of neural projections and not projective geometry. That is, inter-pixel distance and resolution differ drastically at different points on the retina. Projective geometry also fails for a reason noted above: the array does not show density across its extent, but has discrete subscripts.

Marr's constructs are like pictures in that their two dimensional character is semantically significant, and there is a mapping (though not by visual projective geometry) between array places and the visual field. But the constructs are proposition-like because they contain numeral tokens which require a semantic rule to gain reference. In short, the constructs have some but not all of the characteristics of both propositional and pictorial representation. They employ an ingenious blend of propositional semantics (for pixel tokens), pictorial semantics (for distance across pixels), propositional syntax (for numeral tokens within a pixel), and pictorial syntax (for a meaningful two dimensional token). They have both analog features (dense pixel tokens) and digital features (discrete array subscripts).

The moral is that the word-picture dichotomy is a false one. The constructs employed in computational theories of vision drive a wedge between--or better, constitute an intervening representation between--the representational party lines of Words versus Pictures. The primal sketch is not a proposition, but it is not an picture either: it is an array (ordered by retinal coordinates) of primitives measuring magnitudes of light features. There is no conclusive answer to the question: "Is the primal sketch a Picture or a Proposition?" and that dichotomy should be abandoned. Philosophical accounts of cognitive science must broaden their conceptions both of computation and of representation to include what is, after all, the paradigm case: computing with numerals.

Ashley, J. R. (1963) *Introduction to Analog Computation*. New
York: John Wiley and Sons, Inc.

Bullock, T. H. (1973) Seeing the world through a new sense: electroreception in fish. American Scientist, 61: 316-325.

Carnap, *R*. (1939) *Foundations of Logic and Mathematics*.
Chicago: University of Chicago Press.

Fodor, J. A. (1975) *The Language of Thought*. Cambridge, MA:
Harvard University Press.

Frisby, J. P. (1980) *Seeing*. Oxford: Oxford University
Press.

Goodman, *N*. (1976) *Languages of Art*. Second Edition.
Indianopolis: Hackett Publishing Company, Inc.

Grimson, W. E. L. (1981) *From Images to Surfaces*. Cambridge,
MA: MIT Press.

Hopcroft, J. E., and Ullman, J. D. (1979*) Introduction to
Automata Theory, Languages, and Computation*. Reading, MA:
Addison-Wesley Publishing Company.

Krantz, D. H.; Luce, R. D.; Suppes, P.; and Tversky, A. (1971) *Foundations
of Measurement*. Volume I. *Additive and Polynomial
Representations*. New York: Academic Press.

Marr, D. (1982) *Vision*. San Francisco: W. H. Freeman and
Company.

Marr, D. (1979) Representing and computing visual information. In:
P. H. Winston and R. H. Brown (eds.) *Artificial Intelligence: An
MIT Perspective*. Vol. 2. Cambridge, MA: MIT Press.

Marr, D. (1976) Early processing of visual information. *Philosophical
Transactions of the Royal Society of London* *275*, (942):
483-534.

Marr, D. and Hildreth, E. C. (1980) Theory of edge detection. *Proceedings
of the Royal Society of London* *B* *207*: 187-217.

Marr, D. and Poggio, T. (1977) From understanding computation to
understanding neural circuitry. *Neuroscience Research Program
Bulletin 15* (3): 470-488.

Marr, D. and Poggio, T. (1979) A theory of human stereo vision. *Proceedings
of the Royal Society of London* *B* *204*: 301-328.

Perkel, D. H. and Bullock, T. H. (1968) Neural coding. *Neuroscience
Research Program Bulletin*, *6*: 221-348.

Pylyshyn, Z. W. (1980) Computation and cognition: issues in the
foundations of cognitive science. *Behavioral and Brain Sciences*
3: 111-169.

Quine, W. V. O. (1969) *Set Theory and its Logic*. Revised
Edition. Cambridge, MA: Harvard University Press.

Stoll, R. (1963) *Set Theory and Logic*. Dover Books.

Wilson, H. R. and Bergen, J. R. (1979) A four mechanism model for
threshold spatial vision. *Vision Research 19*: 19-32.

Supported by a grant from the National Institute of Mental Health while the author was a postdoctoral fellow at the L. L. Thurstone Psychometric Lab, Department of Psychology, University of North Carolina at Chapel Hill. The author gratefully acknowledges the criticisms and suggestions of Marcy Lansman, Bill Lycan, and Thomas Wallsten, as well as those of an audience of psychologists and philosophers at Carnegie Mellon University, where some of the ideas of this paper were presented in a colloquium.

Originally typed on a PDP-11 in the L. L. Thurstone Psychometric Lab in 1982. Carried on a computer tape to the IBM mainframe in Tulsa in 1983. Converted in 1984 to Word 1.1 running on Dos 2.0. Converted to various intermediary formats since then.

Return to Austen Clark's online papers .

Return to the Philosophy Department home page.