pDETR: End-to-End Object Detection via Perspective-Aware Transformers

Fuxin Xie; Haomiao Bian; Jian Zhou; Jinsheng Xiao; Bijun Li

pDETR: End-to-End Object Detection via Perspective-Aware Transformers

Fuxin Xie, Haomiao Bian, Jian Zhou, Jinsheng Xiao, Bijun Li

26 Sept 2024 (modified: 26 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Scale, Sparse, Transformer, Object Detection

TL;DR: We introduce a novel Perspective-Aware strategy to improve the performance of Sparse DETR-like Detectors and achieves 52.0%AP on the COCO dataset with 13% parameters reduce .

Abstract: DETR has made notable performance improvements in object detection tasks by leveraging the long-range modeling capabilities of Transformers, but encoding all tokens indiscriminately significantly escalates computational cost and leads to slow convergence. Recent sparsification strategies effectively reduce computational cost through sparse encoders. However, these methods rely heavily on a fixed sparse ratio, which overlooks the coherence of feature representation across levels, leading to performance degradation in complex scenes. To address this issue, we propose a novel object detection approach aimed at constructing consistent representations of multi-level features. The approach composes two steps: First, we introduce a perspective proposal module that leverages the spatial information of high-level foreground features to guide the sparse sampling of low-level features, ensuring both integrity and coherence of multi-scale feature information. Furthermore, we integrated semantic probability to perform hierarchical and dynamic adjustments to the saliency of queries, thereby refining the semantic interaction among foreground queries. Experimental results demonstrate that on the challenging task-specific VisDrone dataset, our pDETR method enhances AP by 1.8% compared to DINO. On the COCO 2017 dataset, the performance improvement of pDETR is even more apparent, achieving a +2.5% increase in AP: under the 1× schedule, pDETR attains an AP of 51.5%, and under the 2× schedule, the AP further increases to 52.0%. Moreover, it exhibits faster convergence, exceeding 40% AP in just 2 training epochs while reducing computational cost by 13% in terms of FLOPs, indicating superior detection capability.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7363

Loading