Keywords: free-object prediction, query-free, detecting ambiguity, location-deduplication decoder
Abstract: Transformer-based detectors, such as DETR and DINO, often struggle with a specific limitation: they can detect only a fixed number of objects based on the predefined number of queries set. This limitation leads to missed detections when the scene exceeds the model’s capacity and increases false positives when the scene contains fewer objects. In addition, existing approaches often combine one-to-one and one-to-many matching label assignment methods in the decoder for accelerating the model training and convergence. However, this operation can introduce a new detecting ambiguity issue, which is often overlooked by those methods. To address these challenges, we propose QFree-Det, a novel query-free detector capable of dynamically detecting a variable number of objects across different input images. In particular, we present an Adaptive Free Query Selection (AFQS) algorithm to dynamically select queries from the encoder tokens, which resolves the issue of fixed capacity. Then, we propose a sequential matching method that decouples the one-to-one and one-to-many processes into separating sequential steps, effectively addressing the issue of detecting ambiguity. To achieve the sequential matching, we design a new Location-Deduplication Decoder (LDD) by rethinking the role of cross-attention (CA) and self-attention (SA) within the decoder. LDD first regresses the location of multiple boxes with CA in a one-to-many manner and then performs object classification to recognize and eliminate duplicate boxes with SA in a one-to-one manner. Finally, to improve the detection ability on small objects, we design a unified PoCoo loss that leverages prior knowledge of box size to encourage the model to pay more attention to small objects. Extensive experiments on COCO2017 and WiderPerson datasets demonstrate the effectiveness of our QFreeDet. For instance, QFree-Det achieves consistent and remarkable improvements over DINO across five different backbones. Notably, QFree-Det obtains a new state-of-the-art of 54.4% AP and 38.8% APs on val2017 of COCO with the backbone of VMamba-T under 1× training schedule (12 epochs), higher than DINO-VMamba-T by +0.9% AP and +2.2% APs. The source codes will be released upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4457
Loading