Mining representative tokens via transformer-based multi-modal interaction for RGB-T tracking

Pujian Lai, Dong Gao, Shilei Wang, Gong Cheng

Published: 18 Jul 2025, Last Modified: 27 Jan 2026Pattern RecognitionEveryoneCC BY 4.0

Abstract: RGB-T tracking leverages the complementarity of visible and thermal modalities for robust performance in challenging environments. However, previous RGB-T trackers are vulnerable to irrelevant backgrounds and ignore the modality gap. To address the above issues, we propose MRTTrack, a Transformer-based RGB-T tracking framework consisting of a multi-modal separate-then-collaborative (MSC) module and a cross-modal discrepancy constraint (CDC). Specifically, the MSC is designed to mitigate irrelevant background interference and operates in two stages: target-oriented token selection and multi-modal token interaction. By recursively aggregating attention maps across layers, the target-oriented token selection produces an index mask for representative tokens, which is then used to guide multi-modal token interaction via mask-based attention. Additionally, CDC enforces consistency across modalities