HA-Bins: Hierarchical Adaptive Bins for Robust Monocular Depth Estimation Across Multiple Datasets

Ruijie Zhu; Ziyang Song; Li Liu; Jianfeng He; Tianzhu Zhang; Yongdong Zhang

HA-Bins: Hierarchical Adaptive Bins for Robust Monocular Depth Estimation Across Multiple Datasets

Ruijie Zhu, Ziyang Song, Li Liu, Jianfeng He, Tianzhu Zhang, Yongdong Zhang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Existing monocular depth estimation methods have achieved satisfactory performance on wild datasets. However, these methods are usually trained and tested on a single dataset, which makes them difficult to generalize to other scenarios. To learn diverse scene priors from multiple datasets, we propose a hierarchical framework with adaptive bins for robust monocular depth estimation, which consists of two critical components: a group-wise query generator to assign hierarchical bins and a correlation-aware transformer decoder to generate adaptive bin features. The proposed HA-Bins enjoys several merits. First, the group-wise query generator progressively increases the number of bin queries for multi-scale image features, resulting in a hierarchical bin distribution robust to diverse scenarios. Second, the correlation-aware transformer decoder refines the correlation of bin queries and image features, effectively improving adaptive image feature aggregation. We visualize the query activation maps on NYUDepthv2 dataset, showing that the proposed network effectively suppresses the depth-irrelevant regions. Experiments on KITTI, Sintel, and RabbitAI benchmarks show that without any fine-tuning, our model jointly trained on multiple datasets achieves competitive performance with the state-of-the-art and solid robustness toward diverse scenarios. In addition, our method wins second place in Robust Vision Challenge 2022 towards challenging scenarios with different characteristics.

Loading