Keywords: Depth, Depth Completion, Vision Transformer, Computer Vision
Abstract: This paper proposes a joint convolutional attention and Transformer block, which deeply couples the convolutional layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid structure naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (near 1/3 FLOPs) compared to pure Transformer-based methods. Especially when the captured depth is highly sparse, the performance gap with other methods gets much larger.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Community Implementations: [ 1 code implementation](https://www.catalyzex.com/paper/completionformer-depth-completion-with/code)
5 Replies
Loading