Abstract: The key to effective RGB-T tracking lies in the feature extraction and the feature fusion of RGB and thermal infrared (TIR) images. Currently, the main approach for RGB-T trackers alternates between intra-modal feature extraction and inter-modal feature fusion. However, this design might confuse the pre-trained model and fail to fully exploit the potential of feature learning. Additionally, current RGB-T trackers are primarily based on CNNs or Transformer networks. As is well known, CNNs are limited by their receptive fields, and Transformer networks often suffer from computational inefficiencies. To address these issues, we propose a novel RGB-T tracker based on the Transformer-Mamba Trident-Branch (TMTB) architecture. Our tracker consists of an RGB Branch and a TIR Branch, both utilizing a pre-trained Transformer encoder, and a Fusion Branch based on Mamba. This design ensures the independence of intra-modal feature extraction and inter-modal feature fusion processes. It fully leverages the pre-trained model’s ability to interact between the template and search region within a single modality, allowing the Fusion Branch to focus on mining the capability of inter-modal feature fusion from the search area only. Moreover, we capitalize on the characteristics of Mamba, such as its dynamic parameter property for fusing RGB and TIR features, and its linear complexity for modeling long-range dependencies. Our method ultimately achieves a balanced performance in terms of accuracy, parameters, and speed across multiple datasets, including LasHeR, RGBT234, and RGBT210. Our findings partially validate the effectiveness of Mamba’s characteristics in facilitating Multi-Modal fusion.
External IDs:dblp:conf/pricai/DuZWZH24
Loading