Abstract: Most RGB-T trackers heavily rely on bottom-up attention and thus overlook top-down cross-modal guidance for learning target features. Consequently, the discriminative power of the learnt target features is weak. To address this issue, we propose a novel RGB-T tracker (called TGTrack) that designs a Top-down Cross-modal Guidance mechanism to learn target features in two stages. In the first stage, our TGTrack effectively generates top-down cross-modal guidance signals with multi-modal encoders-decoders and prior vectors. In the second stage, these signals are transmitted and integrated to improve the discriminative power of our target features by the attention layers of the cross-modal encoders. Moreover, we introduce an Attention-Driven Spatio-Temporal Updater for updating discriminative target features. Through cross-frame attention guidance, it can effectively eliminates irrelevant features within the search region. As a result, our TGTrack can effectively avoid the complex multi-modal fusion modules and thus achieve robust RGB-T tracking. Extensive experiments on three popular RGB-T tracking benchmarks (i.e., LasHeR, RGBT234, and RGBT210) demonstrate that our TGTrack achieves new state-of-the-art performances.
Loading