Revisiting Emergent Correspondence from Transformers for Self-supervised Multi-frame Depth Estimation

26 Sept 2024 (modified: 18 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-supervised Depth estimation; Multi-frame Depth estimation
Abstract: Self-supervised multi-frame depth estimation predicts depth by leveraging geometric cues from multiple input frames. Traditional methods construct cost volumes based on epipolar geometry to explicitly integrate the geometric information from these input frames. Although this approach may seem effective, the epipolar-based cost volume has two key limitations: (1) it assumes a static environment, and (2) requires pose information during inference. As a result, this cost volume fails in real-world scenarios where dynamic objects and image noise are often present, and pose information is unavailable. In this paper, we demonstrate that the cross-attention map can function as a full cost volume to address these limitations. Specifically, we find that training the cross-attention layers for image reconstruction enables them to implicitly learn a warping function within the cross-attention, resembling the explicit epipolar warping used in traditional self-supervised depth estimation methods. To this end, we propose the CRoss-Attention map and Feature aggregaTor (CRAFT), which is designed to effectively leverage the matching information of the cross-attention map by aggregating and refining the full cost volume. Additionally, we utilize CRAFT in a hierarchical manner to progressively improve depth prediction results through a coarse-to-fine approach. Thorough evaluations on the KITTI and Cityscapes datasets demonstrate that our approach outperforms traditional methods. In contrast to previous methods that employ epipolar-based cost volumes, which often struggle in regions with dynamic objects and image noise, our method demonstrates robust performance and provides accurate depth predictions in these challenging conditions.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6507
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview