QR-DETR: Query Routing for Detection Transformer

Tharsan Senthivel, Ngoc-Son Vu

Published: 2024, Last Modified: 28 Jan 2026ACCV (6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Detection Transformer (DETR) predicts object bounding boxes and classes from learned object queries. However, DETR exhibits three major flaws: (1) Only a subset of object queries contribute to the final predictions, leading to inefficient utilization of computational resources. (2) The self-attention and cross-attention layers indiscriminately mix information across object queries without any guidance, potentially hindering effective learning of object representations. (3) At each decoder stack layers, a query are processed either positively, refining its bounding box and class attributes correctly, or negatively, shifting to predict a different object or increasing its bounding box erroneously. This suggest that query informativeness is non-uniform, and enabling inter-query communication could impede the learning of specialized representations for individual queries. To address these concerns, we propose a learnable query routing method that introduces a routing model to identify the object queries requiring processing at each transformer decoder layer. Selected queries pass through the full decoder, while others exit early, and all are scattered back after processing. This prevents indiscriminate information sharing. Extensive COCO experiments show consistent mAP improvements across various DETR models.

External IDs:dblp:conf/accv/SenthivelV24