Abstract: RGB-based object tracking is a fundamental task in computer vision, aiming to identify, locate, and continuously track objects of interest across sequential video frames. Despite the significant advancements in the performance of traditional RGB trackers, they still face challenges in maintaining accuracy and robustness in the presence of complex backgrounds, occlusions, and rapid movements. To tackle these challenges, combining visual auxiliary modalities has gained significant attention. Beyond this, integrating natural language information offers additional advantages by providing high-level semantic context, enhancing robustness, and clarifying target priorities, further elevating tracker performance. This work proposes the Adaptive Multi-modal Visual Tracking with Dynamic Semantic Prompts (AMVTrack) tracker, which efficiently incorporates image descriptions and avoids text dependency during tracking to improve flexibility and adaptability. AMVTrack significantly reduces computational resource consumption by freezing the parameters of the image encoder, text encoder, and Box Head and only optimizing a few learnable prompt parameters. Additionally, we introduce the Adaptive Dynamic Semantic Prompt Generator (ADSPG), which dynamically generates semantic prompts based on visual features, and the Visual-Language Fusion Adaptation (V-L FA) method, which integrates multi-modal features to ensure consistency and complementarity of information. Additionally, we partition the Image Encoder to conduct an in-depth investigation into the relationship between the importance of features across different depth and width regions. Experimental results demonstrate that AMVTrack achieves significant performance improvements on multiple benchmark datasets, proving its effectiveness and robustness in complex scenarios.
External IDs:dblp:journals/tmm/WangLJWLLCLMW26
Loading