One-Step Video Depth Estimation via Self-Distillation

Published: 05 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0
Keywords: Video depth estimation, diffusion model, self-distillation
TL;DR: We tackle the self-improvement challenge of video depth estimation by distilling iterative diffusion models into a high-fidelity, one-step framework that is up to 20x faster.
Abstract: Diffusion-based video depth estimation methods have recently set new benchmarks by leveraging rich generative priors learned from video synthesis, delivering exceptional depth accuracy and robust temporal consistency. However, the iterative nature of these models creates a computational bottleneck, hindering their utility in autonomous or dynamic environments that require real-time adaptation. To bridge this gap, we frame the efficiency-accuracy trade-off as a self-improvement challenge. We propose a two-stage self-distillation strategy. In the first stage, we distill a multi-step diffusion model into a one-step student by applying latent-space distillation to the Unet via score matching and latent gradient matching. In the second stage, we further distill the decoder using feature alignment and pixel-wise distillation losses. Our method achieves depth accuracy comparable to state-of-the-art multi-step video depth models, while reducing the denoising time by up to 3× and the decoding time by up to 20×.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 70
Loading