OptiRSDE: A Novel Approach with Temporal Smoothing and Optimized Feature Matching for Fast and Robust Depth Estimation

ICLR 2026 Conference Submission18986 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Stereo vision, depth estimation, real-time systems, temporal smoothing, feature matching, object detection
TL;DR: OptiRSDE is a fast, robust stereo depth estimation pipeline that achieves high long-range accuracy using temporal smoothing and optimized feature matching.
Abstract: Depth estimation accuracy over long ranges is a core problem in robotics, maritime autonomy, terrestrial autonomy, and environmental monitoring, where accurate scene understanding is crucial for safe and informed decision-making. Existing monocular solutions suffer a sharp accuracy drop beyond mid-range, with errors of 10–25% at 50–100 m. Recent deep learning–based stereo networks (e.g., FoundationStereo, DSMNet, MonSter, RAFT-Stereo, CREStereo, Selective-Stereo) achieve impressive results on benchmarks but struggle in real-world extended-range scenarios—frequently collapsing at 20–30 m and beyond, where predictions deviate by factors of 2–3x and object-level depth is often lost. In contrast, a calibrated high-quality stereo system can deliver accurate long-range estimates but at the expense of high computational overhead. We introduce OptiRSDE (Optimized Robust Stereo Depth Estimation), a lightweight yet robust classical computer vision pipeline that integrates disparity refinement, temporal smoothing, and QR-code–based synchronization. OptiRSDE achieves <3% error at 50 m and 5–10% at 100 m, substantially outperforming both monocular methods and modern deep learning stereo baselines in real-world conditions. Operating at 5 FPS, while requiring only standard chessboard calibration and YOLO-based object detection for deployment. Temporal smoothing and outlier rejection mitigate depth jitter, producing stable long-range depth at object level. Validated on DrivingStereo and a custom 1080p stereo dataset, our system demonstrates scalable, real-time, extended-range stereo depth estimation—delivering strong generalization where both monocular and state-of-the-art deep learning methods fail.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18986
Loading