Continuously Learning Video-level Object Tokens for Robust UAV tracking

Bin Chen, Shenglong Hu, Gang Dong, Lingyan Liang, Dongchao Wen, Kaihua Zhang

Published: 01 Jan 2025, Last Modified: 16 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Due to the dynamic changes in flight motion and viewpoint, the objects in unmanned aerial vehicle (UAV) tracking scenarios often suffer from drastic appearance variations. Existing UAV trackers often leverage a frame-level matching mechanism, which measures the appearance similarity between the object template and the search frame. The drastic object appearance variations degrade the learned model, leading to drift issue. To this end, this paper presents a video-level UAV tracking framework that focuses on Continuously Learning (CL) effective and efficient spatio-temporal object tokens for robust tracking, dubbed as CLTrack. Specifically, the CLTrack first learns a series of spatio-temporal object tokens via a dynamic filtering module (DFM), which encodes more consensus object appearance information from each frame. Afterwards, a spatio-temporal enhancement module (STEM) is designed via cascading a temporal and a spatial attention to fully interact with the selected tokens with stable long-range spatio-temporal context information of the tracked object. Finally, to ensure the learned model encodes the rich context information without catastrophic forgetting, a video-level tracking loss is designed to supervise feature learning from the whole video frames. Extensive experiments on three UAV benchmarks including UAV123, DTB70 and VisDrone2018 demonstrate that the proposed CLTrack achieves state-of-the-art performance.