Adapting CLIP for DETR-based Object Detection

Abhirup Mallik

Published: 30 Sept 2024, Last Modified: 01 Oct 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Object detection involves class identification and spatial positioning. While DETR-based architectures have shown promising detection capabilities by framing the task as set prediction, prior approaches have limited refinement for object features, leading to inferior inherent understanding of objects, particularly when generalizing to unseen categories. To this end, we propose CLIP-DETR, a novel detection framework that harnesses the pretrained visual-linguistic capabilities of CLIP to enhance both the encoding and decoding processes in DETR models. Our method focuses on two key principles: 1) feature map sensitivity to objects, and 2) query adaptability. Extensive experiments demonstrate that CLIP-DETR significantly outperforms state-of-the-art models in object detection, open-vocabulary object detection, and zero-shot object recognition tasks, illustrating its superior generalization and recognition abilities.