Abstract: Transformer-based trackers have demonstrated remarkable advancements in real-time tracking tasks on edge devices. Since lightweight backbone networks are typically designed
for general-purpose tasks, our analysis reveals that, when applied
to target tracking, they often contain structurally redundant
layers, which limits the model’s efficiency. To address this issue,
we propose a novel tracking framework that integrates backbone
pruning with Hybrid Knowledge Distillation (HKD), effectively
reducing model parameters and FLOPs while preserving high
tracking accuracy. Inspired by the success of MiniLM and Focal
and Global Distillation (FGD), we design a HKD framework
tailored for tracking tasks. Our HKD introduces a multi-level
and complementary distillation scheme, consisting of Token
Distillation, Local Distillation, and Global Distillation. In Token
Distillation, unlike MiniLM, which distills attention via QK dotproducts and V, we disentangle and separately distill Q, K, and
V representations to enhance structural attention alignment for
tracking. For Local Distillation, we use the FGD concept by
incorporating spatial foreground-background masks to capture
region-specific discriminative cues more effectively. In Global
Distillation, we use Vision Mamba module to model long-range
dependencies and enhance semantic-level feature alignment. Our
tracker HKDT achieves state-of-the-art (SOTA) performance
across multiple datasets. On the GOT-10k benchmark, it demonstrates a groundbreaking 67.6% Average Overlap (AO), outperforming the current SOTA real-time tracker HiT-Base by 3.6%
in accuracy while reducing computational costs by 64% and
achieving 115% faster tracking speed on CPU platforms. The
code and model will be available soon.
Loading