Editorial: Perceptual organization in computer and biological vision

James H. Elder, Mary A. Peterson, Dirk B. Walther

Published: 01 Jan 2024, Last Modified: 18 Dec 2024Frontiers Comput. Sci. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: execu�ng a smooth versus abrupt change in orienta�on. They find that a triplet of dots forming an obtuse angle (more than 90 degrees) is perceived as a smooth contour, whereas a triplet forming an acute angle (less than 90 degrees) is perceived as an abrupt vertex. Dot displays that describe curvilinear contours as opposed to sharp-angled ver�ces allowed for clearer percep�on, beter mental rota�on, and more accurate detec�on of shapes. These results may reflect the underlying sta�s�cs of smooth contour curvatures and abrupt orienta�on changes we encounter in the visual world.Ul�mately, neural mechanisms must organize local spa�al features coded in early stages of the visual system into the coherent object representa�ons we perceive. The grouping cues that support this computa�on include geometric regulari�es of the object's bounding contour (e.g., good con�nua�on) as well as photometric regulari�es within the object (e.g., colour similarity; Elder & Zucker, 1996, 1998). In this contribu�on, Hii & Pizlo propose a foveated shortest-path model of contour grouping to explore the poten�al fusion of geometric contour cues and colour cues in recovering complete object boundaries. Psychophysical results demonstrate that the human visual system can synergis�cally combine geometric and colour grouping cues, in qualita�ve agreement with their computa�onal model.The contribu�on from Hii & Pizlo concerned how local elements on the re�na are organized into a representa�on of a coherent figure or object. But perceptual organiza�on extends beyond a single figure to determine how we perceive collec�ons of figures or objects in the scene. One window into this process is provided by the study of crowding. Crowding is the phenomenon that fine spa�al judgements can be made more difficult if extraneous 'distractor' elements are brought near to the s�mulus being judged. Uncrowding refers to the remarkable fact that adding a regular patern of mul�ple distractors can release this effect. This phenomenon has generally been atributed to the perceptual organiza�on of these extraneous elements into a perceptual group apart from the s�mulus being judged. However, in their psychophysical study, Choung et al. find that, while the degree of uncrowding is strongly correlated with perceived grouping, simple models of perceptual grouping fail to account for this rela�onship. This suggests that the forma�on of perceptual groups may depend upon subtle interplays and higher-level perceptual interpreta�ons of the visual s�mulus that are not easily captured by a simple combina�on of Gestalt laws.The studies discussed so far provide intriguing insight into perceptual organiza�on in the 2D image plane. But how does this relate to the structure of our 3D visual world? Bolelli et al. note that the back-projected boundaries of solid objects are generally not planar curves (Koenderink, 1984), and their 3D structure can be important to perceptual organiza�on and object understanding. Fortunately, this 3D structure can poten�ally be recovered via the geometry of the binocular projec�on. Bolelli et al. introduce a mathema�cal framework rela�ng the projected geometry of these 3D curves to binocular neural selec�vity. Based on tools from sub-Riemannian geometry, their model makes predic�ons about how interac�ons between neurons in early visual cortex should depend upon the ocularity and joint posi�on-orienta�on tuning of the neurons. This model provides a framework for understanding the stereo correspondence problem as well as torsional eye movements.The challenge of perceptual organiza�on extends not only over the three dimensions of space but also the dimension of �me. The review ar�cle by Lappin & Bell details how the brain uses spa�otemporal regulari�es in moving images to perceptually organize the visual stream into con�nuous surface representa�ons that support the discrimina�on of fine spa�otemporal judgements with hyperacuity precision.The foveated shortest-path object grouping model of Hii & Pizlo entails an incremental construc�on of progressively more global, complex and complete representa�ons. While Hii & Pizlo do not suggest a specific mapping of their model to brain regions, it is common to assume that such computa�ons proceed hierarchically from early to later visual areas. However, a body of work from von der Heydt and colleagues (Zhou et al., 2000;Cra� et al., 2007;von der Heydt, 2015;Williford & von der Heydt, 2016), and others, provides an alterna�ve account. These findings include neural sensi�vity in earlier areas of visual cortex to illusory contours and figure/ground assignment that could only emerge from more global computa�ons, challenging the conven�onal view. In par�cular, the iden�fica�on of border ownership cells in cor�cal area V2 that respond selec�vely to a contour depending upon the figure/ground sign is strong evidence against a feedforward, hierarchical view of object percep�on. What is the alterna�ve? Von der Heydt reviews computa�onal and neurophysiological research suppor�ng the existence of grouping cells (G cells) that pre-aten�vely link neurons in early visual areas that are selec�ve for contours to form representa�ons of global 'proto-objects' via recurrent processing. Von der Heydt conjectures that these G cells might be located outside of the object pathway in ventral stream, since recordings in areas V1, V2, and V4 have failed to confirm their existence.Peterson & Campbell also present evidence against a feedforward account of visual percep�on. They show that recurrent processing plays an essen�al role in the percep�on of classic figureground displays that were long taken as evidence that convexity is an important prior in building objects in a botom-up fashion. Previously, Peterson and Salvagio (2008;Goldreich & Peterson, 2012) found that convexity is a weak figural prior unless it is supplemented by a background prior. The background prior requires homogeneous fill in concave regions alterna�ng with convex regions. Peterson & Campbell show that the convexity prior and the background prior conflict in tradi�onal displays where both convex and concave regions are homogeneously colored and that recurrent processing resolves this conflict before conscious percep�on. Furthermore, they iden�fy both cor�co-cor�cal and cor�cal-thalamic recurrent processes in the perceptual organiza�on of the classic displays. Their experiments show that dynamical recurrent interac�ons are involved in some of the founda�onal experiments taken as evidence for a feedforward model of figure-ground percep�on.It has long been debated whether the process of amodal comple�on of par�ally occluded objects demands aten�on and awareness, or whether it can occur autonomously. Here, Kimchi et al. report four experiments inves�ga�ng this ques�on, using a variant of a color-opponent flicker technique in which a priming s�mulus can be presented for a dura�on necessary for perceptual comple�on while remaining outside perceptual awareness. Kimchi et al. used this technique to create priming s�muli that cued either a local, global or ambiguous interpreta�on of a subsequent target s�mulus. They found that when the prime indicated a local comple�on, local targets were classified faster than global targets, sugges�ng that local comple�on can take place without visual awareness. However, when the prime cued a global or ambiguous interpreta�on, target responses were unaffected by the prime, which they take as evidence that awareness is necessary to resolve ambiguity and to generate a global comple�on.Vision is only one of the human senses, and fusion with hap�c sensing could be par�cularly important to inform the perceptual organiza�on of par�ally occluded objects that are only par�ally visible to the eye. Prior work has shown that par�ally occluded faces are more easily recognized when the occluders are stereoscopically rendered to appear in front, rather than behind the faces. Here, Takeichi et al. use virtual reality to inves�gate how both visual and hap�c informa�on about the rela�ve depth of the occluder affects recogni�on of katakana characters. While the hap�c cue was found to increase the confidence of observer judgements of the rela�ve depth of the occluder, there was no effect on character recogni�on. Also, counter to prior work with faces, character recogni�on was beter when the "occluder" was rendered to be behind, rather than in front of the character, sugges�ng that 3D processing may be different for specialized 2D s�muli like textual characters than for faces.The research reviewed above largely follows in the tradi�on of Gestalt psychology in using highly simplified s�muli to isolate specific perceptual factors and test hypotheses. However, the matura�on of computer vision technologies provides opportunity to explore whether principles of perceptual organiza�on generalize to real-world scenes in all of their complexity. Walther et al. provide a useful resource for this endeavour with their Mid-Level Vision (MLV) Toolbox. The toolbox offers algorithms for extrac�ng contours from photographs and for compu�ng a variety of contour proper�es: orienta�ons, curvature, length, and contour junc�ons. Relying on the medial axis transform as a dual representa�on of scene contours, the toolbox provides code to compute measures of local parallelism, local mirror symmetry and contour separa�on. The toolbox also contains code for visualizing these proper�es and for manipula�ng contour drawings based on them.The success of deep learning models in solving computer vision problems has led to their adop�on as poten�al models for predic�ng neural and behavioural response to visual s�muli. While these models do capture many aspects of neural and behavioural response there are intriguing divergences in how networks handle out-of-distribu�on perturba�ons such as image blur. Here, Yoshihara et al. find that training convolu�onal networks with a mix of blurry and sharp images makes them more human-like in their robustness to blur and weigh�ng of shape vs texture in making classifica�on decisions (Geirhos et al., 2018).Training with blurred s�muli likely knocks out fine-scale texture cues that networks tend to rely on by default, upweigh�ng the use of shape cues. But what is the nature of the shape cues that these networks can use? While humans make profound use of configural shape informa�on, recent research suggests that deep networks struggle to organize these global shape cues, relying more on local shape features (Baker et al., 2018;Baker & Elder, 2022). In their contribu�on, Jarvers & Neumann perform a new analysis of deep neural network shape sensi�vity that suggests that the addi�on of recurrent or residual connec�ons can enhance sensi�vity to non-local shape, although not to the extent seen in humans. These results suggest future direc�ons for neural network design that may lead to models that are beter able to capture the human ability to organize local features into representa�ons of global object shape.Deep learning models have made substan�al gains in performance through mechanisms of 'self-aten�on' and 'cross-aten�on', that allow for mul�plica�ve interac�ons between data inputs and are the basis for more recent state-of-the-art transformer architectures. Here, Mehrani & Tsotsos argue that the effect of self-aten�on is in fact more appropriately described as perceptual organization based on feature similarity. In a series of computa�onal experiments, they demonstrate that vision transformers learn to group s�muli based on features such as hue, lightness, satura�on, shape, size, or orienta�on and suggest that this can be thought of as a form of horizontal relaxa�on labeling. This novel view provides insight into how transformer architectures may solve difficult perceptual organiza�on problems that challenge convolu�onal architectures.