Abstract:High-performance video object tracking is pivotal for video comprehension and analysis. There exists evident temporal correlation information among consecutive video frames. Nevertheless, current methods fail to effectively leverage this temporal information, leading to inaccurate feature representation of visual targets and heightened risks of tracking failure. To tackle this issue, we introduce a novel transformer-based tracker, dubbed the Video Temporal-Spatial Features and Long-term Memory (TSFLM) tracker. Firstly, the Encoder sequentially integrates multiple self-attention modules to extract spatial and temporal features, respectively. Secondly, we design a novel continuous template update module capable of preserving the long-term memory of the target template. Thirdly, we employ the long-term memory template to further augment the feature representation of the input image (search frame). Finally, the tracking results are derived through the decoder. Extensive comparative experiments against baselines on multiple challenging benchmarks demonstrate that our tracker achieves state-of-the-art performance. The source codes will be accessible at https://github.com/SYLan2019/TSFLM-Tracker.