Keywords: multi-view stereo, depth estimation
Abstract: Metric depth is foundational for perception, prediction, and planning in autonomous driving.
Recent zero-shot metric depth foundation models still exhibit substantial distortions under large-scale ranges and diverse illumination.
While multi-view stereo (MVS) offers geometric consistency, it fails in regions with weak parallax or textureless areas.
On the other hand, directly using sparse LiDAR points as per-view prompts introduces noise and gaps due to occlusion, sparsity, and projection misalignment.
To address these challenges, we introduce \textbf{Prompt-MVS}, a cross-view prompt-enhanced framework for metric depth estimation.
Our key insight is to inject LiDAR-derived prompts into the cost volume construction process through a differentiable, matching-aware fusion module, enabling the model to leverage accurate metric cues while preserving dense geometric consistency provided by the MVS process.
Furthermore, we propose depth-spatial alternating attention (DSAA), which combines spatial information with depth context, significantly improving multi-view geometric consistency.
Experiments on KITTI, DDAD, and NYUv2 demonstrate the effectiveness of Prompt-MVS, which outperforms state-of-the-art methods by up to 34.6\% in scale consistency.
Notably, our method remains effective even with missing or highly sparse prompts and produces stable metric depth under severe occlusion, weak texture, and long-range scenes, demonstrating strong robustness and generalization.
Our code will be publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16275
Loading