SMART: Semantic-Aware Masked Attention Relational Transformer for Multi-label Image Recognition

Hongjun Wu, Cheng Xu, Hongzhe Liu

Published: 01 Jan 2022, Last Modified: 25 Jul 2025IEEE Signal Process. Lett. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As objects usually co-exist in an image, learning the label co-occurrence is a compelling approach to improving the performance of multi-label image recognition. However, the dependencies among the non-exist categories in an image cannot be effectively evaluated. These redundant label dependencies may bring noise and further decrease the performance of classification. Therefore, we proposed SMART, a Semantic-aware Masked Attention Relational Transformer for Multi-label Image Recognition for multi-label image recognition tasks. In addition to leveraging Transformer to model the inter-class dependencies, the proposed masked attention filters out the redundant dependencies among the non-exist categories. SMART is able to explicitly and accurately capture the label dependencies without extra word embeddings. Moreover, our method achieves new state-of-the-art results on two benchmarks for multi-label image recognition: MS-COCO 2014 and NUS-WIDE. In addition, extensive ablation studies and empirical analysis are provided to demonstrate the effectiveness of the essential components of our method under different factors.