RGBT tracking via frequency-aware feature enhancement and unidirectional mixed attention

Jianming Zhang, Jing Yang, Zikang Liu, Jin Wang

Published: 2025, Last Modified: 05 Jun 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: RGBT object tracking is widely used due to the complementary nature of RGB and TIR modalities. However, RGBT trackers based on Transformer or CNN face significant challenges in effectively enhancing and extracting features from one modality and fusing them into another modality. To achieve effective regional feature representation and adequate information fusion, we propose a novel tracking method that employs frequency-aware feature enhancement and bidirectional multistage feature fusion. Firstly, we propose an Early Region Feature Enhancement (ERFE) module, which is comprised of the Frequency-aware Self-region Feature Enhancement (FSFE) block and the Cross-attention Cross-region Feature Enhancement (CCFE) block. The FFT-based FSFE block can enhance the feature of the template or search region separately, while the CCFE block can improve feature representation by considering the template and search region jointly. Secondly, we propose a Bidirectional Multistage Feature Fusion (BMFF) module, with the Complementary Feature Extraction Attention (CFEA) module as its core component. The CFEA module including the Unidirectional Mixed Attention (UMA) block and the Context Focused Attention (CFA) block, can extract information from one modality. When RGB is the primary modality, TIR is the auxiliary modality, and vice versa. The auxiliary modal features processed by CFEA are added to the primary modal features. This information fusion process is bidirectional and multistage. Thirdly, extensive experiments on three benchmark datasets — RGBT234, LaSHeR, and GTOT — demonstrate that our tracker outperforms the advanced RGBT tracking methods.