Structure-grounded Training Strategies Aid Generalization in Stereo Matching

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Stereo Matching, Generalization, Data Augmentation, Training Strategy
TL;DR: We explore a structure-grounded training design with three strategies that directly improves generalization of RNN-based stereo matching models using only a limited amount of synthetic stereo data.
Abstract: Stereo matching networks can suffer from generalization challenges when trained on synthetic data and deployed in real-world settings. While existing methods rely on fine-tuning or pre-trained vision foundation models for cross-domain robustness, we revisit this gap from a training perspective and explore a structure-grounded training design that directly improves generalization of RNN-based stereo matching models using only a limited amount of synthetic stereo data, without changing the network architecture or adding any inference overhead. Specifically, we target all three main modules of a typical stereo matching pipeline: in cost volume construction, we enhance geometric cues through data augmentation; in context encoding, we strengthen semantic guidance via auxiliary multi-task context supervision; in recurrent disparity refinement, we regulate update dynamics with depth-update regularization. Experiments on multiple mainstream architectures and diverse real-world datasets suggest consistent gains in robustness, improving RAFT-Stereo by 6.6% on KITTI 2015, IGEV-Stereo by 13.7% on Middlebury, and DLNR by 55.4% on ETH3D. These insights reveal the previously overlooked importance of structure-grounded training design for achieving reliable stereo depth estimation under data-scarce, domain-shifted conditions.
Supplementary Material: pdf
Submission Number: 95
Loading