Simplifying Cross-modal Interaction via Modality-Shared Features for RGBT Tracking

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Thermal infrared (TIR) data exhibits higher tolerance to extreme environments, making it a valuable complement to RGB data in tracking tasks. RGB-T tracking aims to leverage information from both RGB and TIR images for stable and robust tracking. However, existing RGB-T tracking methods often face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction process. To address these issues, we propose a novel Integrating Interaction into Modality-shared Fearues with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. Modality-shared branch aggregates modality-shared information and implements inter-modal interaction with the Vision Transformer(ViT). Specifically, our approach first extracts modality-shared features from RGB and TIR features using a cross-attention mechanism. Furthermore, we design a Cross-Attention-based Modality-shared Information Aggregation (CAMIA) module to further aggregate modality-shared information with modality-shared tokens.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our work proposes a novel method to aggregate modality-shared information and bridge interaction among modality-shared and modality-specific features for stable and robust RGB-T object tracking, which offers a novel perspective on multimodal data fusion. It is a high-performance framework for RGB-T tracking and achieves state-of-the-art performance.
Submission Number: 4719
Loading