Multi-Level CLIP Transfer for Open-Vocabulary Object Detection

Yingfan Chen; Meina Kan; Hao Liang; Zheng Qingfang; Shiguang Shan

Multi-Level CLIP Transfer for Open-Vocabulary Object Detection

Yingfan Chen, Meina Kan, Hao Liang, Zheng Qingfang, Shiguang Shan

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: open-vocabulary object detection

TL;DR: We propose the Multi-level CLIP Transfer strategy to leverage the complementary strengths of CLIP and the detector for open-vocabulary object detection.

Abstract: Open-vocabulary object detection (OVOD) aims to detect novel objects beyond the training categories. Recent approaches extend conventional detectors to OV detectors by combining their detector scores with zero-shot classification scores of pre-trained vision-language models such as CLIP, which is capable of identifying various visual concepts via language descriptions. However, such a simple score-level combination struggles to balance the localization and classification of novel objects: CLIP encodes global semantics for accurate classification but exhibits limited sensitivity to localization precision when scoring proposals, whereas the detector provides robust localization yet tends to misclassify novel objects as background. Instead of a trade-off, our goal is to leverage the complementary strengths of CLIP and the detector. To this end, we propose the Multi-level CLIP Transfer (MCT-Det) strategy, which effectively transfers context, alignment, and generalization knowledge from CLIP to the detector at three distinct levels. Specifically, for each region proposal: 1) At the feature-level, we refine region features by dynamically integrating CLIP’s global context via cross-attention to improve localization. 2) At the embedding-level, we integrate the region representations of CLIP and the detector into a unified embedding to couple image-text alignment with localization-awareness for reliable recognition. 3) At the score-level, we follow previous methods to exploit CLIP's zero-shot classification ability via the scores combination strategy. Building upon F-ViT, our MCT-Det achieves comprehensive improvements and outperforms state-of-the-art methods, with 52.9 AP50novel on OV-COCO and 39.8 mAPr on OV-LVIS using ViT-L/14.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11958

Loading