AMITA: Attribute-Guided Masked Image-Text Alignment for Multi-Label Image Representation

Jinyi Fang, Bingke Zhu, Jingling Yuan, Yingying Chen, Ming Tang, Jinqiao Wang

Published: 01 Nov 2025, Last Modified: 09 Nov 2025IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Multi-label image classification, which involves recognizing multiple objects within a single image, is a fundamental task in computer vision. Recently, Visual-Language Models (VLMs) have made remarkable progress in this area. Many approaches combine textual and visual modalities to understand the entire image. In this paper, we find that there is a direct correlation between the accurate localization of objects and the accuracy of multi-label classification. However, previous research methods did not specifically address localization accuracy, resulting in sub-optimal accuracy. Therefore, we propose the AMITA, namely Attribute-guided Masked Image-Text Alignment for multi-label image representation. AMITA improves localization accuracy by segmenting object masks, thereby enhancing the accuracy of multi-label image classification. Additionally, AMITA introduces an AutoFocus method to handle the localization problem of small objects. AutoFocus conducts recognition by resizing and cropping the image respectively, and automatically selects the images useful for the classification target. Moreover, AMITA incorporates Attribute-guided Prompting to strengthen the semantic distinction among different categories. It uses large language models to obtain the attributes of different categories and carefully designs prompts to enhance the attribute differences among different categories. Finally, extensive experiments on three popular datasets, including MS-COCO, Pascal VOC 2007, and NUS-WIDE, demonstrate the superiority of AMITA.

External IDs:doi:10.1109/tcsvt.2025.3577277