Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

Published: 09 Sept 2024, Last Modified: 11 Sept 2024ECCV 2024 Wild3DEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Depth estimation, computer vision, diffusion
TL;DR: Monocular metric depth estimation using diffusion models with field-of-view conditioning
Abstract: Despite significant progress on domain and camera-specific models for monocular depth estimation, accurate _metric_ depth estimation for images in the wild remains largely unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to varying camera intrinsics. We propose a generic, task-agnostic diffusion model for monocular metric depth estimation, with several key advancements to enable joint modeling of indoor and outdoor scenes and handling diverse camera intrinsics. These include a log-scale depth parameterization, synthetic augmentations to generalize beyond the limited camera intrinsics in training datasets, and using a diverse training data mixture to improve generalization. We further show that conditioning on the field-of-view (FOV), instead of the much stronger entire intrinsics, is sufficient to handle scale ambiguity. Finally, we show that with efficient parameterization, inference is remarkably fast, requiring just a few denoising iterations. The resulting method, dubbed DMD (Diffusion for Metric Depth), significantly outperforms recent methods on a diverse set of indoor and outdoor zero-shot benchmarks.
Submission Number: 26
Loading