High-Performance Discriminative Tracking with Spatio-Temporal Template Fusion

Published: 26 Oct 2025, Last Modified: 25 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: The current one-stream tracking framework has received far-reaching attention for its significant improvement in tracking performance, yet it is essentially an extension of Siamese trackers. However, the one-stream framework of discriminative trackers has not been effectively exploited, still using separate feature extraction and model prediction. Therefore, this article aims to implement a one-stream learning strategy for feature extraction and model prediction under the discriminative tracking framework. To this end, we have leveraged the prevailing Vision Transformer and Vision Mamba backbones to achieve our motivation. Moreover, we innovatively combine templates with discriminative tracking methods to enhance the ability of target-aware feature learning, and further propose the attention fusion module to implement spatiotemporal template fusion, which can enhance the adaptability of the tracking model to dynamic changes of targets. The experiments on multiple popular tracking benchmarks have demonstrated that our proposed tracking architecture has superior tracking performance. Concisely, our tracker obtains an AUC of 73.3% on LaSOT dataset, and an AO of 78.2% on GOT-10k dataset. The code, raw results, and trained models are available at https://github.com/hexdjx/VisTrack.
Loading