Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation

Gonzalo Martin Garcia; Karim Abou Zeid; Christian Schmidt; Daan de Geus; Alexander Hermans; Bastian Leibe

Efficient Fine-Tuning of Image-Conditional Diffusion Models for Depth and Surface Normal Estimation

Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Depth Estimation, Fine-Tuning, Diffusion Pretraining

TL;DR: Some current diffusion-based depth estimation models are flawed, and simple end-to-end finetuning of Stable Diffusion outperforms much more complicated baselines.

Abstract: Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. We show that the inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configurations while being more than 200× faster. Furthermore, we show that end-to-end finetuning with task-specific losses enables deterministic single-step inference, outperforming previous diffusion-based depth and normal estimation models on common zero-shot benchmarks. This fine-tuning scheme works similarly well on Stable Diffusion directly.

Submission Number: 47

Loading