Keywords: Depth Estimation, Fine-Tuning, Diffusion Pretraining
TL;DR: Some current diffusion-based depth estimation models are flawed, and simple end-to-end finetuning of Stable Diffusion outperforms much more complicated baselines.
Abstract: Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. We show that the inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configurations while being more than 200× faster. Furthermore, we show that end-to-end finetuning with task-specific losses enables deterministic single-step inference, outperforming previous diffusion-based depth and normal estimation models on common zero-shot benchmarks. This fine-tuning scheme works similarly well on Stable Diffusion directly.
Submission Number: 47
Loading