Visual Prompt with Larger Model for Multi-modal Tracking

Simiao Lai, Yuntao Wei, Dong Wang, Huchuan Lu

Published: 01 Jan 2024, Last Modified: 21 Oct 2025ICPR (Workshops and Challenges, 4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The multi-modal tracking focusing on RGB, Depth, and Infrared fusion aims to advance the field of tracking by leveraging the combined strengths of RGB images, depth maps, and infrared data. In this paper, an end-to-end large visual prompt multi-modal object tracker, named VPLMMT, is proposed to achieve robust and accurate tracking in complex environments. For the limited three-modal training data, we employ a prompt tuning paradigm to fine-tune a large foundational model pre-trained on RGB data. This approach not only leverages the advantages of pre-trained feature representations but also significantly reduces training costs and memory requirements. We utilize depth maps and thermal infrared images as visual prompts for the RGB images, enabling the original foundational model to adapt to multi-modal tasks. The performance of the framework is evaluated through its application to the multi-modal (RGB, Depth and Thermal Infrared modalities) video dataset released for the competition. The results verifies that the proposed method achieves the 3rd overall tracking performance on the track of this competition.