MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

Renrui Zhang; Han Qiu; Tai Wang; Ziyu Guo; Ziteng Cui; Yu Qiao; Hao Dong; Peng Gao; Hongsheng Li

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hao Dong, Peng Gao, Hongsheng Li

22 Sept 2022 (modified: 04 Aug 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Monocular 3D object detection, detection transformer, multi-view 3D object detection

Abstract: Monocular 3D object detection has long been a challenging task in autonomous driving, which requires to decode 3D predictions solely from a single 2D image. Most existing methods follow conventional 2D object detectors to first localize objects based on their centers, and then predict 3D attributes by neighboring features around them. However, only using such local visual features is insufficient to understand the scene-level 3D spatial structures and ignores the long-range inter-object depth relations. In this paper, we introduce a novel framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR. We modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues. Specifically, concurrent to the visual encoder that explores object appearances, we specialize a depth encoder to produce the non-local depth embeddings for the scene-level geometric information. Then, we represent 3D object candidates as a set of queries and propose a depth-guided decoder with depth cross-attention modules, which conduct both inter-object and object-scene depth feature interactions. In this way, each object query estimates its 3D attributes adaptively from the depth-guided regions on the image and is no longer constrained to only use local visual features. On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art and requires no extra dense depth annotations. In addition, our depth-guided transformer can also be extended to 3D object detection from multi-view images and show superior performance on nuScenes dataset. Extensive ablation studies have demonstrated the effectiveness of our approach.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2203.13310/code)

5 Replies

Loading