Keywords: Language-guided object tracking;Multimodal Large Language Models;Text Refinement;First-frame localization;
Abstract: Language-guided object tracking aims to locate the target in a video based solely on a natural language description, without any bounding box supervision. While recent methods have made encouraging progress by incorporating language into visual tracking, most treat it as an auxiliary signal rather than a primary driver. This limits their effectiveness in fully language-only scenarios, which remain underexplored despite their user-friendly nature. In this paper, we propose MAGTrack, a novel framework for language-guided object tracking that seamlessly integrates Multimodal Large Language Models (MLLMs) without requiring additional training. MAGTrack tackles key challenges through two plug-and-play modules: the MLLM-based Grounding Module (MGM) and the MLLM-based Text Refinement Module (TRM). MGM leverages MLLM reasoning to achieve accurate initial target localization, even in challenging scenarios with visually similar objects. Complementarily, TRM dynamically updates the textual description based on the current visual context and tracking history. Extensive experiments on four benchmarks—OTB99, TNL2K, LaSOT, and LaSOText
—demonstrate that MAGTrack consistently improves both first-frame grounding and long-term tracking accuracy, achieving state-of-the-art performance under the language-only setting.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12466
Loading