Stacked Multi-Head Cross Attention for Image Recognition

Shuaichen Liu, Xiao Du, Tieru Wu

Published: 13 Sept 2024, Last Modified: 24 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Long-range dependency plays a critical role in extracting intricate image features particularly in tasks involving image recognition. In previous study, the significance of long-range positional dependencies has been proved in both image classification and image segmentation. Based on this, we introduce a Multi-Head Cross Attention module, namely MHCA, along with four different operators, which are designed to capture and integrate contextual information at every pixel position within feature maps, spanning both horizontal and vertical directions, with parallel fashion, thus can transfer information and share weights across multiple heads of features. Moreover, by stacking our module twice, forming \({\rm{MHC}}{{\rm{A}}}^2\) layer, the whole context of each pixel in feature can be captured, with more lighter computation burden than general full connection or Non-local networks, and it is designed to be seamlessly plugged into existing network architectures. By replacing specific convolution layer in convolutional networks with a \({\rm{MHC}}{{\rm{A}}}^2\) layer, we construct MHCA network. Through extensive experiments upon various datasets, we demonstrate the validity of our approach. Furthermore, comparative analysis with similar methodologies highlight the superior performance of our method.

External IDs:doi:10.1145/3700906.3700965