Robust Multi-Stage Tracking via Multi-Scale and Multi-Level Representation Learning

Published: 2025, Last Modified: 04 Nov 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: How to learn multi-scale and multi-level representations is crucial for robust tracking. However, most current one-stream structure based trackers with visual transformers (dubbed ViTs) cannot effectively capture multi-scale representations due to the structure of their adopted ViTs is non-hierarchical. Meanwhile, they often only use the output features from the final layer for predicting results (i.e., ignoring the utilization of low-level features from the shallow layers) which may result in a certain degree of lacking multi-level representation learning ability. To address these issues, we propose a robust multi-stage tracker that effectively combines the advantages of both hierarchical and one-stream structured ViT as a tracking backbone to improve the multi-scale and multi-level representation learning abilities. Specifically, first of all, we design a hierarchical tracker with a three-stage backbone. In the first two stages of our tracker, we utilize a dual-branch structure to obtain multi-scale features of the template and search region separately. Especially, We design the local scale awareness modules based on simple MLP layers to capture multi-scale features. These modules remove complex operations such as convolutions or shifted window attentions, thus avoiding the performance degradation caused by traditional hierarchical ViTs. In the third stage (i.e. the main stage), we construct a global encoder based on the one-stream ViT to achieve efficient feature extraction and feature interaction for our tracker. Then, we design a multi-level feature integration module in the main stage to explicitly utilize the representation information learned from the shallow layers and fuse them with the features of the final layer to obtain multi-level representation information. Lastly, benefit from the these designs, our tracker can effectively capture more multi-scale and multi-level representations for robust tracking. Comprehensive experiments on GOT-10 k, LaSOT, LaSOT$_{ext}$, TNL2K, UAV123, TrackingNet and VOT2020 benchmarks validate the effectiveness and robustness of our method.
Loading