Abstract: Diffusion-based monocular depth estimation models demonstrate strong performance with limited supervision by leveraging pre-trained text-to-image models. However, their multi-step inference process and large model size create prohibitive computational overhead for practical applications.
To retain the data efficiency of diffusion models while addressing their inference inefficiency, we propose a framework that enhances diffusion-based depth estimation through a two-stage training approach. The first stage distills implicit depth knowledge in the latent space by leveraging the rich representations from pre-trained diffusion models. The second stage refines explicit depth predictions in pixel space using Hybrid Depth Loss that combines Shift-Scale Invariant (SSI) loss for global structure preservation with Edge-aware Gradient Huber loss for fine-grained detail enhancement. Both components are adaptively weighted using a dynamic task weighting strategy, balancing structural consistency and boundary precision.
Specifically, we demonstrate that our two-stage distillation approach yields D$^3$epth, an efficient variant that achieves state-of-the-art results while significantly reducing computational requirements. In parallel, our base model D$^2$epth, trained with enhanced pixel-space depth loss, also surpasses state-of-the-art performance across various benchmarks.
Overall, these results deliver the accuracy benefits of diffusion-based methods at the efficiency level of traditional data-driven approaches.
Submission Number: 35
Loading