Rethinking Attention Mechanism: Channel Re-attention and Spatial Multi-region Attention for Fine-grained Visual Classification

Xiaohui Wang; Yulin Sun; Xin Liu; Zhipeng Zou; Li Wang; Kun Wang; Xiaoyang Liang; Wei Liu

Rethinking Attention Mechanism: Channel Re-attention and Spatial Multi-region Attention for Fine-grained Visual Classification

Xiaohui Wang, Yulin Sun, Xin Liu, Zhipeng Zou, Li Wang, Kun Wang, Xiaoyang Liang, Wei Liu

Published: 01 Jan 2025, Last Modified: 23 Jul 2025Neural Process. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Fine-grained visual classification (FGVC) aims to classify sub-categories, such as different kinds of birds, varying brands of cars, etc. Learning feature representations from discriminative parts of an object has always played an essential role in this task. Recently, applying the attention mechanism to extract discriminative parts has become a trend. However, using the classical attention mechanism brings two main limitations in FGVC: First, they always focus on informative channels in feature maps but ignore those with poor information, which also contain fine-grained knowledge that is helpful for classification. Second, they largely stare at the most salient parts of objects but ignore the insignificant but discriminative parts. To address these limitations, we propose channel re-attention and spatial multi-region attention for fine-grained visual classification (CRA-SMRA), which incorporate two lightweight modules that can be easily inserted into existing convolutional neural networks (CNN): On the one hand, we provide a channel re-attention module (CRAM), which can select the importance of the channels of the feature map of the current stage, obtaining more discriminative features and enabling the network to mine useful fine-grained knowledge in information-poor channels. On the other hand, a spatial multi-region attention module (SMRAM) is proposed to calculate the spatial matching degree of feature maps in different stages, obtaining multi-stage feature maps that focus on different discriminative parts. Our method does not require bounding boxes/part annotations and can be trained in an end-to-end way. Extensive experimental results on several fine-grained benchmark datasets demonstrate that our approach achieves state-of-the-art performance.

Loading