Mamba-Enhanced Visual-Linguistic Representation for Multi-Label Image Recognition

TMLR Paper6240 Authors

17 Oct 2025 (modified: 30 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multi-label image recognition stands as a foundational task in computer vision. Recently, vision-language models have achieved significant progress in this domain. However, previous approaches mostly utilized language models in a simplistic manner, without fully leveraging their potential. To address this, we propose a Mamba-enhanced Visual-Linguistic Representation (MVLR) framework for multi-label image recognition, which aims to better leverage the capabilities of the visual-linguistic representations. In our MVLR, we first propose a Prompt-Driven Label Representation learning (PDLR), which consists of both hard and soft prompts for acquiring comprehensive semantic knowledge for all labels from the large language model. After extracting the label representations, we propose an Interaction and Fusion Model (IFM) to interact with those representations and then fuse them together. To be specific, IFM first employs a label attention to explore the label co-occurrence relations and a context-aware attention to adaptively aggregate context information into label representations. Then, IFM further employs a channel attention to fuse the two features together, forming more reliable and effective label representations. Finally, we propose a Quadruplet Mamba-enhanced Visual-Linguistic block (QMVL) to mutually interact with visual and linguistic features with the strong structure of Mamba. Our QMVL simultaneously emphasizes the features of both visual and linguistic modalities, which is greatly different from previous works of taking linguistic information as a secondary supplementary item. Extensive experiments on several popular datasets, including MS-COCO, Pascal VOC 2007 and NUS-WIDE for general multi-label recognition, demonstrate the superiority of our MVLR.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Vinay_P_Namboodiri1
Submission Number: 6240
Loading