Abstract: The goal of RGB-T tracking is to enhance the accuracy and robustness by leveraging the complementary features of RGB and TIR modalities in complex scenarios. Previous methods have overlooked the power of semantic features in extracting valuable information from different modalities and improving interactions across them. Moreover, using Bounding Boxes (BBox) for target initialization can cause issues like bounding box blurring and tracking drift when the target’s appearance changes or gets occluded. To address these challenges, we propose the CLIP-based RGBT tracking algorithm TIETracker, which aims to to exploit the complementary advantages of multimodality more effectively using textual information. Textual descriptions direct the backbone network to learn target representations in multimodality and facilitate the interaction of multi-modal features. Additionally, in scenarios of occlusion and scale transformations that lead to missing or altered target features, textual information adaptively supplements the target representation. This approach also improves the response in the image region of the target, addressing issues with bounding box accuracy and tracking drift. Our extensive evaluation on three leading RGB-T tracking benchmarks demonstrates that TIETracker achieves competitive compared to state-of-the-art methods, effectively countering feature loss from changes in target appearance and occlusion.
External IDs:dblp:conf/iros/XiaMWZL25
Loading