Overlapped Trajectory-Enhanced Visual Tracking

Li Shen, Xuyi Fan, Hongguang Li

Published: 01 Jan 2024, Last Modified: 12 Jun 2025IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep-learning-based methods have achieved promising performance in visual tracking tasks. However, the backbones of the existing trackers normally emanate from the object detection realm, making them inefficient and insufficient in terms of spatial template matching. Moreover, such trackers apply temporal information without considering its authenticity during the online inference step, rendering them prone to error accumulation. To address these two issues, this work proposes OTETrack, a novel visual tracker with overlapped feature extraction and robust trajectory enhancement. The backbone of OTETrack, termed Overlapped ViT, slices the input image into overlapped patches to attain stronger template matching capabilities and sends them to alternating attention modules to maintain high model efficiency. Moreover, the trajectory enhancement mechanism in OTETrack is used to predict the center of the ladder-shaped Hanning window, which mildly penalizes the displacements between the spatial tracking results and the temporal predicted results to maintain the tracking consistency of a video sequence, thus mitigating the influences of spurious temporal information. Extensive experiments conducted on five benchmarks with thirteen baselines demonstrate the state-of-the-art performance of OTETrack. The source code and Appendix are released on https://github.com/OrigamiSL/OTETrack.