IST 8A, Fall 2008

Lecture 10

Vision Systems II



Through the case of Dee Fletcher who had selective brain damage, Milner and Goodale have developed a picture of the visual system as specialized into two subsystems with very different purposes. These are vision for action and vision for perception. In turn, each of these subsystems is further specialized into modules designed to extract different features of the visual scene (in the vision for perception stream) or to implement different motor actions based on visual information (in the vision for action stream).


This lecture will summarize the modularity of the visual system, and talk further about the computational methods used by each subsystem to accomplish its purposes.


Two visual systems and the tasks they perform

The authors discuss two main visual systems and the tasks they perform, in primates (which have been studied extensively) and humans (whose studies comes from lesioned patients such as Dee Fletcher). From our previous discussion we know that there are more than two systems, but the rest of the discussion in Chapter 4 is focused mainly on vision for action and vision for perception.


Vision for perception

According to the authors, one defining characteristic of vision for perception is that it creates internal representations of the visual world, which can be accessed “off-line” for various purposes. These are our conscious visual perceptions. The pathways of vision for perception are not linked to motor areas of the brain. Instead they include areas with specialized abilities to recognize edges, textures, faces, familiar objects and places. These areas all lie within the ventral temporal lobe. From there the pathways link to other areas involved in memory, emotion and decision making.


Vision for action

These pathways are concerned with acting in the present – taking fast action without reflection. Memory and emotion do not enter but fast motor response and control does. The dorsal stream proceeds from V1 into the parietal lobe. An evolutionarily older stream bypasses the LGNd and V1, going instead to through the superior colliculus. In primates this stream controls head and eye movements, but unlike the frog it is now refined by the presence of the dorsal stream. This means that primate head and eye movements are much more sophisticated than those of a frog. But control of these motions is not accessible to consciousness, any more than it would be in the frog (to the extent that a frog is conscious!).


Evidence for separate systems

Prior to Dee Fletcher, lesion studies with monkeys dating back to the 1800s had suggested the existence of two separate vision systems. Monkeys whose dorsal stream (i.e. what we would now call the dorsal stream) had been damaged would fumble when trying to pick up objects, even though they could see the objects perfectly well. And monkeys whose ventral stream was damaged could not be trained to recognize new objects, but could dexterously pluck flies out of the air!


More recently, techniques were developed to monitor individual neurons firing. When applied to ventral stream areas, the techniques discovered that some neurons preferentially fire for edges shown to the monkey at specific angles. And we already know about the fusiform gyrus, which has neurons that fire selectively for faces. A common trait of all these neurons is that they are quite “particular” about which visual features cause them to fire, but also quite insensitive to other features such as viewpoint or lighting.


This discovery of preferential firing was the beginning of our current conception that the ventral stream’s job is to represent features of the visual world as opposed to reacting to them. And it gives a clue to the nature of our conscious perceptions: when viewing from a variety of vantage points, we still recognize the same scene.


Modules – the building blocks of the visual streams

The idea is that the broad specialization of visual streams into two main tasks – vision for action and vision for perception – is implemented by sub modules belonging to each stream. The existence of modules is indicated by imaging and electrode recording studies, and also by lesions: localized damage can result in specific gaps in cognition or behavior.


In the vision for perception stream there are sub modules for recognition of objects, places and faces. In the vision for action stream there are areas for reaching, grasping, and eye-flick motions (saccades) among others. A deficit in one of the ventral stream modules leads to some kind of “agnosia” (= “not knowing”), e.g. prosopagnosia (face blindness), topographical agnosia (inability to recognize familiar landmarks).


The concept for isolating these areas and abilities is the same as that for the broader visual streams themselves. It is termed double dissociation. Generally, double dissociation means that in separate patients, it can happen that ability A is lost in one patient while ability B is retained; and in the second patient ability A is retained but B is lost. This means that these abilities are dissociated or unconnected in the brain. Double dissociation is what leads to the concept of brain modularity.


The dorsal stream, by contrast, has areas that fire only when visual information is accompanied by action. For example, there are separate areas that fire when a monkey reaches for a target, or just flicks its eyes toward it.


Terminology for deficits due to damage in each stream

Here is a summary of terminology that describes the different conditions of dorsal or ventral stream damage:

  1. Visual form agnosia

“Not knowing” aspects of the visual field; blindness to at least some features: damage to the ventral stream (Dee Fletcher)

  1. Visual ataxia

Incoordination of muscle movements involved in grasping objects; inability to reach for or appropriately align the hand to pick up something: damage to the dorsal stream


Overall organization

This section reviews the organization of the visual streams, starting with the areas in the occipital cortex (the “V’s”) along with their connections to the areas for recognition and action.


The next figure shows the locations of the v-areas in the occipital cortex.


Figure 1. Diagram of the visual areas in the occipital cortex, seen from medial (left) and lateral (right) viewpoints. Front of the brain is to the left in each picture.


Representations of visual data in the ventral stream

This section will talk about some aspects of visual representation, starting in primary visual cortex (V1) and continuing into the ventral stream. Here there are neurons that selectively fire for certain kinds of input. This means that they encode or represent certain kinds of data but not others.


The next figure is an example of what takes place in V1 itself. This area has neurons specialized for encoding lines and edges at various orientations. Here is a striking example!



Figure 2. Encoding a grid pattern in V1


“Downstream” from V1, in the inferior temporal lobe, are areas we have talked a lot about! These areas are more “fussy” (to use the words of M & G, Chapter 4) about which stimuli they fire for. Thus they encode higher order features of a scene, such as faces.


As the authors state, representation is typical of the action in the ventral stream. These representations have complex interconnections with other brain areas involving memory and emotion, for example. Thus, our vision for perception is heavily connected to other areas involved in consciousness.


Activations in the dorsal stream

The authors also state that visual data reaching the dorsal stream typically causes firing only when there is interaction with an object. This has been determined in single cell recordings of monkeys, showing activation if the monkey reaches for an object, does eye saccades (flicking the eyes toward the object, or grasps the object. Thus, there is not a visual representation in the same way as in the ventral stream. Instead, if you want, there are codings for action in response to visual stimuli. One key difference is that the dorsal stream has direct connections to motor areas of the brain, whereas the ventral stream does not.

The next diagram illustrates some of these ideas.


Figure 3. Organization of the two streams into subcomponents. Source for Figures 2 and 3:


Next is a diagram of the main visual streams and some important modules in each stream. The ventral stream (vision for perception) goes through the inferior temporal lobe (IT). The dorsal stream (vision for action (is mainly found in the intraparietal sulcus (IPS). The “v-areas” have V1 (primary visual area) in the occipital lobe.


The diagram shows the organization of the two streams and their sub modules as described in the book. This leads to the following points:


  1. Visual stimuli for both streams originate in the “V areas” of the occipital cortex. As the diagram indicates, the two streams share the same input. The diagram does NOT show that the dorsal (action) stream also gets input via a pathway that bypasses the V areas. See Figure 4 of Lecture 9.
  2. Perception lies in the ventral stream, also called the inferotemporal cortex (IT). Action lies in the dorsal stream, located in the intraparietal sulcus (IPS).
  3. The modules in the boxes each represent specific discrete areas of the brain. See Plate 4 in Sight Unseen. Their names and functions are given.
  4. Distinct brain areas correspond to distinct but related cognitive functions
  5. These are delineated using the technique of double dissociation. Thus, for example, you can have a patient with “faces but not places” vs. another patient with “places not faces”. Interestingly, this also happens with actions. Thus, there are people with the ability to move their eyes (saccades) toward an object, but not grasp it. And vice versa.
  6. More precise delineation of these areas has been accomplished by fMRI which shows brain activations in response to specific tasks.
  7. The modules in the perception stream are specialized for extracting specific representational information from the visual scene. Modules in the action stream are specialized for implementing specific vision-controlled motor actions.
  8. The time frame of each system is very different. The representations of the perception system are stored for use minutes to years later. The computations of the action system are useless after the action is done, because they are so specific. Thus they are forgotten – or actually never stored.
  9. Time considerations are highlighted by some interesting experimental results. Dee Fletcher can do well at grasping objects in real time, even though she cannot see them. But if forced to delay a few seconds, and then reach for where an object used to be her performance is terrible! On the other hand, a person with optic ataxia (dorsal stream damage, inability to reach in real time) actually improves if forced to delay. Then the visual representations can actually be of some help in guiding the delayed reaching action!


Computational tasks of each system

The perception and action streams were revealed through lesions that resulted in behavioral deficits. And the deficits delineated the nature of each stream. It became apparent that the perception stream was for making representations that could be processed “off line” as in thinking, remembering and planning, while the action stream was for guiding motion “on line”. These concepts are summarized by the following points:


  1. Perception by means of representation. This representation is contextual which means it shows how the components of the scene are related to one another. This is the kind of representational ability we use when watching TV or a movie. From the relationships of the characters and objects we derive a sense of meaning such as a story. But this ability is useless for actually interacting with objects on the screen. We can’t pick anything up that is shown in the movie!
  2. Action is done in real time. This means you actually have to interact with an object, such as reaching for it and adjusting your grasp to pick it up. The information to do so is of a very different sort than the representational information of perception.
  3. Thus the computational demands of the two systems are very different. This is one more reason why the two systems are separate. To combine them in one module would be a computational nightmare.
  4. Perception uses a scene-based metric which means it represents all objects in the context of the scene. This representation is insensitive to individual viewpoint. It uses abstraction (I was here, she was there on my right) in a way that can be remembered.
  5. Action uses an ego-based metric which means it depends on the real-time position of the person. For example, when you want to pick up a cup, precise calculations have to be made for your current position and distance from the cup, and the shape and size of the cup itself. This is constantly updated as you reach out to grasp the cup. But after the action is done, there is no need to retain these computations. This is one reason why the action system is not part of our conscious perception.


Experiments to reveal the different types of computation

The text describes several interesting types of experiments to reveal the different computational abilities of each system. The perception system needs to form judgments of relative closeness or distance between objects. It uses cues such as object occlusion (one object partially blocking another) or relative size to make these inferences. The action system needs to compute absolute positions of objects in order to interact with them.


Illusion experiments

This is why the perception system is more able to be fooled by illusions. The common factor in the illusions cited by the text is that context makes us perceptually misjudge the size of one component. Our perception of an object’s size can be judged if we are asked to indicate how big it is by opening our fingers. Typically subjects will open to the wrong width when influenced by context.  But when asked to grasp the component (if the illusion is implemented in 3D) our hands open to the appropriate width. Thus our action system is not fooled. This shows that it depends on a different kind of computation that ignores contextual cues.



Pantomime is the process of consciously (i.e. using the perceptual system) mimicking motions that are normally carried out by the action system. It is the distinction of pretending to pick up a non-existent cup vs. actually picking up a real cup. Most normal people are worse when doing pantomime than in executing the real actions. This is because the pantomimed actions depend on the very different computation in the perceptual system – which goes for generalities and context over precision.


For example, one experiment is to see an object, and then the object is removed. After a short delay, the subject is asked to pantomime picking up the object. To do so, the person has to consciously remember and visualize the object, then adjust the hands and arm reach accordingly. This is crude for normal people but Dee Fletcher cannot do it at all, even though she could pick up the object in real time. On the other hand, people with dorsal stream damage (ataxia) cannot pick up objects in real time. But their performance actually improves when asked to pantomime!


Distance computations in vision for action

We have already mentioned how the perception system uses cues to make contextual judgments about a scene. That is why it is so easily fooled. But how does the action system avoid these traps? How does it compute exact metrics for picking up an object or threading a needle?

The answer is binocular vision. In Dee Fletcher’s case, she computes distance to an object by monitoring (unconsciously, of course) the degree to which both her eyes must converge to fixate on the object. This is revealed by the fact that she does poorly on motor tasks when one eye is covered.


-- Evan Fletcher