D$^3$epth: Distilling Diffusion Models For Efficient Depth Estimation Through A Two-Stage Approach

Published: 01 Sept 2025, Last Modified: 18 Nov 2025ACML 2025 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion-based monocular depth estimation models demonstrate strong performance with limited supervision by leveraging pre-trained text-to-image models. However, their multi-step inference process and large model size create prohibitive computational overhead for practical applications. To retain the data efficiency of diffusion models while addressing their inference inefficiency, we propose a framework that enhances diffusion-based depth estimation through a two-stage training approach. The first stage distills implicit depth knowledge in the latent space by leveraging the rich representations from pre-trained diffusion models. The second stage refines explicit depth predictions in pixel space using Hybrid Depth Loss that combines Shift-Scale Invariant (SSI) loss for global structure preservation with Edge-aware Gradient Huber loss for fine-grained detail enhancement. Both components are adaptively weighted using a dynamic task weighting strategy, balancing structural consistency and boundary precision. Specifically, we demonstrate that our two-stage distillation approach yields D$^3$epth, an efficient variant that achieves state-of-the-art results while significantly reducing computational requirements. In parallel, our base model D$^2$epth, trained with enhanced pixel-space depth loss, also surpasses state-of-the-art performance across various benchmarks. Overall, these results deliver the accuracy benefits of diffusion-based methods at the efficiency level of traditional data-driven approaches.
Submission Number: 35
Loading