Abstract: We address the problem of estimating a high-quality dense depth map from a single RGB input image. We first analyze Conditional Random Field (CRF) in combination with transformers and exploit the multi-head attention mechanism to compute a potential function. Then, we propose spatial window CRFs and channel-wise CRFs to observe information in spatial and channel dimensions, and fuse them with a two-way fusion module, which is called Dual aggregation CRFs (DCRFs). Finally, the information from the multi-scale features observed by DCRFs is used for internal scene clustering by slot-attention to obtain the depth map. We call our method as MSD-CRFs. Experiments demonstrate that our method improves the performance across all metrics on the KITTI, and outperforms current SOTA results on the main ranking metrics $A b s \_$Rel on NYU Depth-v2. Further, we explore the model generalization capability via zero-shot test.
Loading