Convolutional Attention Fusion for RGBT Tracking

Dawei Zhang, Xuan Wang, Xin Xiao, Hao Chen, Zhonglong Zheng

Published: 01 Jan 2025, Last Modified: 26 Jul 2025IEEE Signal Process. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: RGBT target tracking accomplishes the tracking task by fusing visible and thermal infrared information. The development of Convolutional Neural Networks (CNNs) and Transformer has greatly advanced this field. Most existing transformer-based trackers focus on global modeling while neglecting the utilization of local information. In this paper, we propose a novel Convolutional Attention Fusion Module (CAFM) for RGBT target tracking. To be specific, this module continuously slides local windows on the image like convolution, and captures context features in each window like attention. Additionally, local position embedding is added to the window and works in conjunction with global position embedding to enhance the model's understanding of spatial information. Therefore, our CAFM enhances the extraction of local features by restricting the attention area and promotes multimodal fusion through cross-attention. We extend OSTrack to RGBT tracking and integrate the proposed CAFM into it. Experimental results show that our method performs well on the LasHeR, RGBT210, and RGBT234 datasets, and is superior to other advanced trackers.