Keywords: Cross Resolution, Encoding and Decoding, DETR, Detection
TL;DR: Computationally Efficient High Resolution DETR
Abstract: Detection Transformers (DETR) are renowned object detection pipelines, however
computationally efficient multiscale detection using DETR is still challenging. In
this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism
that allows DETR to achieve the accuracy of high-resolution detection while
having the speed of low-resolution detection. CRED is based on two modules;
Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention
(OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder
output to a high-resolution feature. While OSMA is designed to fuse multiscale
features in a single step and produce a feature map of a desired resolution enriched
with multiscale information. When used in prominent DETR methods, CRED
delivers accuracy similar to the high-resolution DETR counterpart in roughly 50%
fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED
(calling CRED-DETR), becomes 76% faster, with ∼ 50% reduced FLOPs than its
high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan
to release pretrained CRED-DETRs for use by the community.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5803
Loading