Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

Yifan Xu; Mengdan Zhang; Xiaoshan Yang; Changsheng Xu

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

Yifan Xu, Mengdan Zhang, Xiaoshan Yang, Changsheng Xu

Published: 01 Jan 2024, Last Modified: 15 Apr 2025IEEE Trans. Image Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.

Loading