State Space Models for Natural Language Tracking: Exploring Context-adaptive Language Cues

Yuyang Tang, Yinchao Ma, Dengqing Yang, Jie Xiao, Tianzhu Zhang

Published: 01 Jan 2025, Last Modified: 24 Jan 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Natural language tracking aims to locate the target of a video based on a language description. The rich contextual information inside the language and video sequence is essential to describe the target movements and appearance variations. However, existing natural language trackers design a fixed-length memory to store historical target information, which merely uses limited context information and necessitates manually designed modules, resulting in sub-optimal localization performance and numerous computational costs. Inspired by the success of the state space model, we propose a novel Context-adaptive Mamba Tracker (CMTrack). It enjoys several merits. First, we propose a novel context-aware state space model that enables language features to serve as hidden states to interact with relevant image features adaptively. Second, CMTrack transfers the hidden states frame-by-frame to continuously incorporate contextual target information into language features, enabling context-adaptive language cues. Third, the proposed context-adaptive language cues can effectively capture the long-range behavior of the target and guide the tracker in locating the target accurately without any extra design. Finally, CMTrack provides a neat pipeline for training and tracking with linear complexity. Experimental results demonstrate that CMTrack achieves new state-of-the-art performance.

External IDs:doi:10.1109/tcsvt.2025.3623281