Learning Motion Priors with DETR for Visual Tracking

Qingmao Wei, Bi Zeng, Guotian Zeng

Published: 2024, Last Modified: 27 Feb 2026ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent transformer-based visual tracking models have demonstrated superior performance. However, prior works have been resource-intensive, requiring massive GPU training hours. This resource demand renders them unsuitable for real-world applications. In this paper, we present DETRack, a training-friendly visual object tracking framework that can integrate motion priors by learning. Our framework utilizes an efficient encoder-decoder structure, with the deformable transformer decoder serving as a target head. We introduce a denoising training strategy to simulate historical predictions and enrich the supervision signal during training. Comprehensive experiments confirm the effectiveness and efficiency of our proposed method. Notably, it only takes 11 hours to train DETRack on a single RTX2080Ti, achieving comparable performance on multiple benchmarks to advanced trackers while maintaining a high running speed.

External IDs:dblp:conf/icmcs/WeiZZ24