Cost Volume Meets Prompt: Enhancing MVS with Prompts for Autonomous Driving

19 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-view stereo, depth estimation
Abstract: Metric depth is foundational for perception, prediction, and planning in autonomous driving. Recent zero-shot metric depth foundation models still exhibit substantial distortions under large-scale ranges and diverse illumination. While multi-view stereo (MVS) offers geometric consistency, it fails in regions with weak parallax or textureless areas. On the other hand, directly using sparse LiDAR points as per-view prompts introduces noise and gaps due to occlusion, sparsity, and projection misalignment. To address these challenges, we introduce \textbf{Prompt-MVS}, a cross-view prompt-enhanced framework for metric depth estimation. Our key insight is to inject LiDAR-derived prompts into the cost volume construction process through a differentiable, matching-aware fusion module, enabling the model to leverage accurate metric cues while preserving dense geometric consistency provided by the MVS process. Furthermore, we propose depth-spatial alternating attention (DSAA), which combines spatial information with depth context, significantly improving multi-view geometric consistency. Experiments on KITTI, DDAD, and NYUv2 demonstrate the effectiveness of Prompt-MVS, which outperforms state-of-the-art methods by up to 34.6\% in scale consistency. Notably, our method remains effective even with missing or highly sparse prompts and produces stable metric depth under severe occlusion, weak texture, and long-range scenes, demonstrating strong robustness and generalization. Our code will be publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16275
Loading