Multi-modal Contextual Prompt Learning for Multi-label Classification with Partial Labels

Rui Wang, Zhengxin Pan, Fangyu Wu, Yifan Lv, Bailing Zhang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ICMLC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-label classification is a task with diverse applications, but current algorithms heavily rely on accurately labeled data, leading to time-consuming and labor-intensive data collection. However, multi-label classification with partial labels presents significant challenges. In this study, we propose Multi-modal Contextual Prompt Learning (MCPL), a novel approach that leverages large-scale visual-language models and exploits the strong image-text alignment in CLIP to address the scarcity of label annotations. We pre-train the visual language model’s encoder on a large number of image-text pairs. We introduce multi-modal contextual prompt learning in both images and labeled text to better utilize the image-label correspondence within CLIP, resulting in enhanced multi-label classification performance, even when faced with partial labels. We also use the coupling function to couple the two modes and realize the interactive connection of the two modal prompts. Extensive experiments on the MS-COCO and VOC2007 datasets, demonstrating its superiority and achieving competitive performance.