Keywords: Stereo Matching, Generalization, Data Augmentation, Training Strategy
TL;DR: We explore a structure-grounded training design with three strategies that directly improves generalization of RNN-based stereo matching models using only a limited amount of synthetic stereo data.
Abstract: Stereo matching networks can suffer from generalization challenges when trained on synthetic data and deployed in real-world settings.
While existing methods rely on fine-tuning or pre-trained vision foundation models for cross-domain robustness, we revisit this gap from a training perspective and explore a structure-grounded training design that directly improves generalization of RNN-based stereo matching models using only a limited amount of synthetic stereo data, without changing the network architecture or adding any inference overhead.
Specifically, we target all three main modules of a typical stereo matching pipeline: in cost volume construction, we enhance geometric cues through data augmentation; in context encoding, we strengthen semantic guidance via auxiliary multi-task context supervision; in recurrent disparity refinement, we regulate update dynamics with depth-update regularization.
Experiments on multiple mainstream architectures and diverse real-world datasets suggest consistent gains in robustness, improving RAFT-Stereo by 6.6% on KITTI 2015, IGEV-Stereo by 13.7% on Middlebury, and DLNR by 55.4% on ETH3D.
These insights reveal the previously overlooked importance of structure-grounded training design for achieving reliable stereo depth estimation under data-scarce, domain-shifted conditions.
Supplementary Material: pdf
Submission Number: 95
Loading