RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

ICLR 2026 Conference Submission5166 Authors

14 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Real-Time Object Detection, Neural Architecture Search, Transfer Learning

TL;DR: We present RF-DETR, a real-time object detector that achieves pareto-optimal accuracy and latency using Neural Architecture Search.

Abstract: Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the "tunable knobs" for NAS to improve the transferability of DETRs to diverse target domains. Our proposed approach outperforms prior state-of-the-art methods at all latencies on COCO and Roboflow100-VL. Notably, RF-DETR (medium) approaches performance parity with GroundingDINO (tiny) on Roboflow100-VL while running 60x as fast, and RF-DETR (nano) achieves 48.0 AP on COCO, improving upon D-FINE (nano) by 5.3 AP.

Supplementary Material: pdf

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5166

Loading