Abstract: While recent large models have greatly improved tracking performance, not all scenes require a large and complex network. Dynamic networks can adapt the architecture to different inputs, leading to notable accuracy and computational efficiency. However, existing dynamic architectures and decision mechanisms designed for classification are not applicable to the tracking task. This paper proposes a dynamic tracking framework based on scene perception, named DynamicTrack. We classify tracking scenes into easy and hard categories, and propose a dynamic architecture with an easy-hard dual-branch to handle different scenes respectively. Unlike previous works in classification that selectively prune a subset of the backbone, complete execution of the entire backbone is necessary for tracking. Hence, we maintain two complete transformer backbones for the dual branches and vary the number of input tokens to achieve modeling at different granularities. Then, we propose a scene router that automatically selects the optimal branch for each input frame. The router directly assesses the scene complexity of features extracted by the easy branch for decision-making, without relying on the tracking head output. This enhances decision efficiency during dynamic inference. Moreover, we introduce two techniques that benefit DynamicTrack optimization, namely, the Gumbel-Softmax trick and cross-branch transmission (CBT). The former increases the stochasticity of decisions and prevents mode collapse into trivial solutions. The latter establishes information transmission between the two branches, facilitating discriminative power and learning efficiency. Extensive experiments on four benchmarks demonstrate that the proposed DynamicTrack achieves SOTA performance and accuracy-speed trade-offs.
Loading