Perceptually-consistent boundaries and shape from stereo occlusions

Jialiang Wang

10 Sept 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: Stereo is the estimation of depth from two views of the same scene. There are two types of depth cues in a stereo image pair: matching and occlusion. Matching refers to regions of the scene that are visible to both views, and occlusion refers to regions that are visible in one view but invisible in the other due to a nearby occluder. Vision science has shown that both types of information are used for depth estimation early in the visual cortex, and human observers can perceive depth even when matching cues are absent or very weak, a capability that remains absent from most computer vision stereo systems. It is one of the reasons why state-of-the-art computer vision stereo algorithms still perform poorly near occlusion boundaries and the associated occluded regions. However, occlusion boundaries are important because they often coincide with object boundaries, and localizing them and predicting precise depth near them is vital for many visual tasks like grasping, manipulation and navigation. This dissertation describes a series of works that make progress towards using stereo occlusion information and improving the depth accuracy near occlusion boundaries. We begin by introducing a taxonomy of local occlusion boundary signatures among patterns of stereo matching scores, categorized by the levels of textures in the nearby foreground and background surfaces. This motivates the investigation of detectors for stereo occlusion boundaries in different types of scenes. Based on this motivation, we design a detector using a simple feedforward network with relatively small receptive fields. We show that the local detector produces better boundaries than many other stereo methods, even without incorporating explicit stereo matching, top-down contextual cues, or single-image boundary cues based on texture and intensity. Next, we describe two algorithms that successfully integrate the matching and occlusion cues in stereo systems. In both algorithms, we represent the disparity map as a piecewise smooth function with explicit breakpoints between its smooth pieces. The first algorithm re-examines the topic of scanline stereo as energy minimization, and we show with this piecewise smooth representation, matching and occlusion signals can be integrated into a simple objective function that can be optimized using dynamic programming. Experimentally, the global optimum of this objective matches human perception on a broad collection of well-known perceptual stimuli, and it also provides reasonable piecewise-smooth interpretations of depth in natural images. The second algorithm is a 2D bottom-up cooperative approach to stereo disparity estimation in a level-set framework. Focusing on bi-layer, figure-ground scenes, the algorithm properly accounts for occlusion geometry in its objective function. With approximate initialization, it converges to estimates of foreground occlusion boundaries that are more accurate than those of many existing techniques.

0 Replies