Keywords: Visual tracking
Abstract: Efficient modeling of spatio-temporal representations in videos is crucial for achieving accurate object tracking. Existing popular one-stream tracking frameworks typically introduce memory mechanisms or specialized modules for temporal modeling. However, due to the gradual degradation of the initial template and unstable updates of target representations, their performance often deteriorates over time. To address this issue, we propose a simple yet effective video-level tracking framework, STARTrack, which realizes the temporal evolution of target and context representations through an iterative token propagation mechanism. Our framework takes as input the features of the search region along with two types of tokens that carry historical representations, and employs a visual encoder for joint modeling. This design enables target-aware perception and adaptively fuses current and historical representations. The proposed method explicitly avoids the sustained reliance on the initial template during long-term tracking, without introducing additional complex context inputs or motion modeling modules, thereby achieving faster inference. Furthermore, we develop a training strategy tailored to our framework. It enhances the semantic coherence of target representations over time via a representation consistency constraint, and for the first time explicitly incorporates occluded frames into the training process. This guides the tracker to learn context representations that are highly correlated with the spatio-temporal state of the target, thereby reducing the reliance on target appearance itself. Extensive experiments on standard benchmarks demonstrate that STARTrack achieves state-of-the-art performance, while maintaining a favorable balance between accuracy and efficiency. The code will be released.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8301
Loading