everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Self-supervised multi-frame depth estimation predicts depth by leveraging geometric cues from multiple input frames. Traditional methods construct cost volumes based on epipolar geometry to explicitly integrate the geometric information from these input frames. Although this approach may seem effective, the epipolar-based cost volume has two key limitations: (1) it assumes a static environment, and (2) requires pose information during inference. As a result, this cost volume fails in real-world scenarios where dynamic objects and image noise are often present, and pose information is unavailable. In this paper, we demonstrate that the cross-attention map can function as a full cost volume to address these limitations. Specifically, we find that training the cross-attention layers for image reconstruction enables them to implicitly learn a warping function within the cross-attention, resembling the explicit epipolar warping used in traditional self-supervised depth estimation methods. To this end, we propose the CRoss-Attention map and Feature aggregaTor (CRAFT), which is designed to effectively leverage the matching information of the cross-attention map by aggregating and refining the full cost volume. Additionally, we utilize CRAFT in a hierarchical manner to progressively improve depth prediction results through a coarse-to-fine approach. Thorough evaluations on the KITTI and Cityscapes datasets demonstrate that our approach outperforms traditional methods. In contrast to previous methods that employ epipolar-based cost volumes, which often struggle in regions with dynamic objects and image noise, our method demonstrates robust performance and provides accurate depth predictions in these challenging conditions.