ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

Yihao Zhen; Qiang Wang; Mingyue Xu; Liangqiong Qu; Huijie Fan

ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

Yihao Zhen, Qiang Wang, Mingyue Xu, Liangqiong Qu, Huijie Fan

15 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Object tracking, Visual-Language tracking, Single object tracking, Visual tracking

TL;DR: ATSTrack is the first Visual-Language tracker to resolve temporal-spatial misalignment between visual and language through fine-grained attribute-based feature alignment and a cross-modal token mechanism, achieving state-of-the-art performance

Abstract: A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by Aligning Temporal and Spatial scale of different input components, named as ATSTrack. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6021

Loading