Abstract: Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method with a customed contrastive learning mode for diffusion models, named D4RD, to resist performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. It takes the sampled noise of the forward diffusion process as a natural reference and guides the predicted noise in different scenes to gather towards a more stable and precise optima. Meanwhile, we further extend noise-level trinity to more generic feature and image levels, building a multi-level contrast to distribute the burden of robust perception across the overall network. Moreover, before handling complex scenarios, we enhance the stability of the baseline diffusion model with three simple but effective improvements, which facilitate convergence and remove depth outliers.
Extensive experiments show that D4RD achieves superior performance to existing state-of-the-art (SoTA) solutions on both synthetic corruption datasets and real-world weather conditions. The code will be available.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: Accurate depth perception of complex scenes contributes to the development of multimedia technology
Supplementary Material: zip
Submission Number: 2846
Loading