High-Performance Discriminative Tracking with Spatio-Temporal Template Fusion
Abstract: The current one-stream tracking framework has received far-reaching
attention for its significant improvement in tracking performance,
yet it is essentially an extension of Siamese trackers. However, the
one-stream framework of discriminative trackers has not been effectively
exploited, still using separate feature extraction and model
prediction. Therefore, this article aims to implement a one-stream
learning strategy for feature extraction and model prediction under
the discriminative tracking framework. To this end, we have
leveraged the prevailing Vision Transformer and Vision Mamba
backbones to achieve our motivation. Moreover, we innovatively
combine templates with discriminative tracking methods to enhance
the ability of target-aware feature learning, and further propose
the attention fusion module to implement spatiotemporal
template fusion, which can enhance the adaptability of the tracking
model to dynamic changes of targets. The experiments on multiple
popular tracking benchmarks have demonstrated that our proposed
tracking architecture has superior tracking performance. Concisely,
our tracker obtains an AUC of 73.3% on LaSOT dataset, and an AO
of 78.2% on GOT-10k dataset. The code, raw results, and trained
models are available at https://github.com/hexdjx/VisTrack.
Loading