Attention learning with counterfactual intervention based on feature fusion for fine-grained feature learning

Ning Yu, Long Chen, Xiaoyin Yi, Jiacheng Huang

Published: 01 Jan 2025, Last Modified: 12 Jul 2025Digit. Signal Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning models can learn features from a large amount of data and usually localize the overall region of the target object accurately in visual recognition tasks. However, in fine-grained scenarios with inter-class similarities, such as brand recognition in vehicles and subspecies recognition in organisms, there is a need to capture crucial distinct features and provide reliable explanations when tracking decision behavior. Therefore, this paper builds on the idea of counterfactual intervention in causal reasoning and proposes a counterfactual intervention of attention learning to learn feature information that plays an important role in fine-grained recognition tasks. First, we use the iterative feature fusion attention module that learns different levels of features and fuses them to capture the crucial features of the target object and suppress attention to the unimportant features. Second, we perform the counterfactual intervention on the feature fusion-based attention map. The changes produced by the intervening variables serve as monitoring signals for attentional learning to enhance the feature learning that contributes positively for the predicted result. Besides, we use the contrast learning function as a constraint to avoid focusing solely on salient features, thus enabling the network model to learn richer differential features. Finally, we use GradCAM visualization to explain the process of decision-making. The experimental results show that the method in this paper learned important distinguishable features of the target object, weakens the attention to non-critical regions, and offers reliable traceability analysis in tracing back decision-making behaviors.