Keywords: stereo disparity estimation, vision transformer, positional encoding
Abstract: Deep stereo disparity estimation has long been dominated by a \textbf{matching-centric paradigm}, built on constructing cost volumes and iteratively refining local correspondences.
Despite its success, this paradigm exhibits an intrinsic vulnerability: visual ambiguities from occlusion or non-Lambertian surfaces invevitably induce errorneous matches that refinement cannot recover.
This paper introduces \textbf{DispViT}, a new architecture that establishes a \textbf{regression-centric paradigm}.
Instead of explicit matching, DispViT directly regresses disparity from tokenized binocular representations using a single-stream Vision Transformer.
This is enabled by a set of lightweight yet critical designs, such as a probability-based disparity parameterization for stable training and an asymmetrically initialized stereo tokenizer for effective view distinction.
To better align the two views during stereo tokenization, we introduce a novel shift-embedding mechanism that encodes different disparity shifts into channel groups, preserving geometric cues even under large view displacements.
A lightweight refinement module then sharpens the regressed disparity map for fine-grained accuracy.
By prioritizing holistic regression over explicit matching, DispViT streamlines the stereo pipeline while improving robustness and efficiency.
Experiments on standard benchmarks show that our approach achieves state-of-the-art accuracy, with strong resilience to matching ambiguities and wide disparity ranges.
Code will be released.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16420
Loading