Keywords: Depth Prediction, Generative Model, Diffusion Models
Abstract: Existing video depth estimation faces a fundamental trade-off: *generative models* often suffer from geometric hallucinations (e.g., inconsistent geometry structures) and scale drift, while *discriminative models* demand massive labeled datasets to resolve semantic ambiguities (e.g., misinterpreting textures or semantic boundaries as geometric structures). To mitigate this trade-off, we present **DVD**, to the best of our knowledge, the *first* framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. **DVD** introduces three targeted adaptation strategies to effectively ground generative priors: (**i**) fixing the diffusion **timestep as a structural anchor** to optimally balance global stability with high-frequency details; (**ii**) applying **latent manifold rectification (LMR)**, a simple parameter-free strategy to counteract regression-induced *mean collapse*, a critical issue that erases geometry information; and (**iii**) uncovering and leveraging the model's inherent **global affine coherence** for a straightforward, overlap-based alignment strategy, enabling seamless long-video inference. Extensive experiments demonstrate that **[OurMethod]** achieves *state-of-the-art* zero-shot performance across standard video benchmarks. Crucially, by inheriting the profound world priors of video foundation models, this deterministic adaptation paradigm proves highly data-efficient, requiring only 367K task-specific downstream training frames to adapt. We will fully release our pipeline to benefit the open-source community.
Submission Number: 182
Loading