Exploring Pruning-based Efficient Object Tracking via Hybrid Knowledge Distillation

Published: 11 Sept 2025, Last Modified: 27 Jan 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneCC BY 4.0
Abstract: Transformer-based trackers have demonstrated remarkable advancements in real-time tracking tasks on edge devices. Since lightweight backbone networks are typically designed for general-purpose tasks, our analysis reveals that, when applied to target tracking, they often contain structurally redundant layers, which limits the model’s efficiency. To address this issue, we propose a novel tracking framework that integrates backbone pruning with Hybrid Knowledge Distillation (HKD), effectively reducing model parameters and FLOPs while preserving high tracking accuracy. Inspired by the success of MiniLM and Focal and Global Distillation (FGD), we design a HKD framework tailored for tracking tasks. Our HKD introduces a multi-level and complementary distillation scheme, consisting of Token Distillation, Local Distillation, and Global Distillation. In Token Distillation, unlike MiniLM, which distills attention via QK dotproducts and V, we disentangle and separately distill Q, K, and V representations to enhance structural attention alignment for tracking. For Local Distillation, we use the FGD concept by incorporating spatial foreground-background masks to capture region-specific discriminative cues more effectively. In Global Distillation, we use Vision Mamba module to model long-range dependencies and enhance semantic-level feature alignment. Our tracker HKDT achieves state-of-the-art (SOTA) performance across multiple datasets. On the GOT-10k benchmark, it demonstrates a groundbreaking 67.6% Average Overlap (AO), outperforming the current SOTA real-time tracker HiT-Base by 3.6% in accuracy while reducing computational costs by 64% and achieving 115% faster tracking speed on CPU platforms. The code and model will be available soon.
Loading