Abstract: RGBT tracking aims to fully exploit the complementary advantages of visible and infrared modalities to achieve robust tracking, thus the design of multimodal fusion network is crucial. However, existing methods typically adopt CNNs or Transformer networks to construct the fusion network, which poses a challenge in achieving a balance between performance and efficiency. To overcome this issue, we introduce an innovative visual state space (VSS) model, represented by Mamba, for RGBT tracking. In particular, we design a novel multi-path Mamba fusion network that achieves robust multimodal fusion capability while maintaining a linear overhead. First, we design a multi-path Mamba layer to sufficiently fuse two modalities in both global and local perspectives. Second, to alleviate the issue of inadequate VSS modeling in the channel dimension, we introduce a simple yet effective channel swapping layer. Extensive experiments conducted on four public RGBT tracking datasets demonstrate that our method surpasses existing state-of-the-art trackers. Notably, our fusion method achieves higher tracking performance compared to the well-known Transformer-based fusion approach (TBSI), while also achieving 92.8% and 80.5% reductions in parameter count and computational cost, respectively.
Loading