Not All Tokens Matter All The Time: Dynamic Token Aggregation Towards Efficient Detection Transformers
TL;DR: We introduce Dynamic DETR, a novel approach that adaptively controls token density via importance-aware distribution and applies multi-level sparsification.
Abstract: The substantial computational demands of detection transformers (DETRs) hinder their deployment in resource-constrained scenarios, with the encoder consistently emerging as a critical bottleneck. A promising solution lies in reducing token redundancy within the encoder. However, existing methods perform static sparsification while ignoring the varying importance of tokens across different levels and encoder blocks for object detection, leading to suboptimal sparsification and performance degradation. In this paper, we propose **Dynamic DETR** (**Dynamic** token aggregation for **DE**tection **TR**ansformers), a novel strategy that leverages inherent importance distribution to control token density and performs multi-level token sparsification. Within each stage, we apply a proximal aggregation paradigm for low-level tokens to maintain spatial integrity, and a holistic strategy for high-level tokens to capture broader contextual information. Furthermore, we propose center-distance regularization to align the distribution of tokens throughout the sparsification process, thereby facilitating the representation consistency and effectively preserving critical object-specific patterns. Extensive experiments on canonical DETR models demonstrate that Dynamic DETR is broadly applicable across various models and consistently outperforms existing token sparsification methods.
Lay Summary: Detection Transformers represent the detection community’s pioneering response to the rise of transformers. Despite demonstrating formidable dominance, this paradigm demands substantial computational resources and is frequently criticized for its slow inference speed, particularly in low-resource scenarios. We set out to make these models more efficient by reducing the number of image pieces (known as *tokens*) that the encoder of the system has to process. While past methods remove tokens in a fixed way, we found that not all tokens are equally important at all stages of the model. Ignoring this leads to unnecessary information loss and worse detection results.
Our approach, called Dynamic DETR, learns to keep more important tokens and discard less useful ones dynamically, adjusting across pyramid levels and stages. It also groups similar tokens together in meaningful ways and ensures the model keeps a consistent understanding of where objects are. This makes object detection models faster and lighter while still keeping their accuracy high to the greatest extent possible, thereby advancing the real-world deployment of cutting-edge AI.
Primary Area: Applications->Computer Vision
Keywords: detection transformer; efficient model; token merging
Submission Number: 3175
Loading