Abstract: Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose Text-Region Matching for optimizing Multi-Label prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the current state-of-the-art methods by a significant margin. Our code is available here.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work significantly advances the field of multimedia/multimodal processing by addressing the critical challenge of matching text and vision features in multi-label image recognition with missing labels. By introducing the TRM-ML method, it innovatively leverages text-region matching to optimize multi-label prompt tuning, focusing on category-aware regions for more meaningful cross-modal matching. This approach not only helps bridge the semantic gap between textual and visual representations but also enhances the accuracy of multi-label image recognition. Furthermore, the incorporation of multimodal contrastive learning narrows the semantic differences between modalities, reinforcing intra-class and inter-class relationships. The novel multimodal category prototype tackles missing labels by estimating unknown labels through the exploration of intra- and inter-category semantic relationships, leading to improved pseudo-label generation. The extensive experiments on benchmark datasets, showcasing significant performance improvements over state-of-the-art methods, underline the contribution of this work to multimedia/multimodal processing. The theoretical analysis provided elucidates the efficacy of text-to-region matching, enhancing our understanding of cross-modal interactions. This work not only presents a tangible advancement in multimedia processing techniques but also sets a new benchmark for future research in the field.
Supplementary Material: zip
Submission Number: 479
Loading