Mamba Adapter: Efficient Multi-Modal Fusion for Vision-Language Tracking

Published: 01 Jan 2025, Last Modified: 04 Nov 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Utilizing the high-level semantic information of language to compensate for the limitations of vision information is a highly regarded approach in single-object tracking. However, most existing vision-language (VL) trackers employ full-parameter fine-tuning, which can easily lead to catastrophic forgetting. Therefore, they fail to fully exploit the prior knowledge of pre-trained models from upstream tasks, resulting in unsatisfactory tracking performance. To alleviate the above problem, we propose a simple yet effective Vision- Language Tracking pipeline based on Mamba Adapter, named MAVLT, which adopts the idea of parameter-efficient fine-tuning (PEFT) to realize the interaction between vision-language modalities. This novel approach offers the following advantages: 1) The knowledge of the upstream pre-trained model is efficiently inherited by freezing its parameters. This ensures that the VL tracking framework only learns the modules for vision and language interaction, with a focus on the fusion between modalities. 2) The modal interaction between language and vision encoders is flexibly bridged in each encoder layer via proposed mamba adapter, enabling efficient interaction of visual and language information at multiple levels. Extensive experiments on five popular vision-language tracking benchmarks validate the effectiveness of the proposed MAVLT. Particularly, the MAVLT achieves 73.4% AUC score on the LaSOT benchmarks with only 0.18%(0.32M) of the total parameters updates. Code and models are available at https://github.com/GXNU-ZhongLab/MAVLT.
Loading