Abstract: Object tracking has advanced significantly with Transformer-based architectures in recent years. However, replacing convolutional layers with global cross-attention in the tracking head of these architectures results in a loss of object-centric inductive bias. Consequently, existing Transformer-based methods often struggle with complex real-life scenarios, such as low resolution, background clutter, and scale variation. To address this issue, we propose a new Vision Transformer-based anchor-free tracking framework named CasCenter. Specifically, the framework features a cascade attention module in the decoder that propagates tracking cues from the previous tracking head to refine object features in a coarse-to-fine manner, enabling the tracker to focus more effectively on the target. Additionally, to further improve tracking stability and accuracy, we incorporate SIoU loss, a multi-scale tracking head, and a Gaussian mask-constrained cross-attention mechanism that emphasizes target regions while suppressing background interference. Extensive experiments demonstrate the superiority of our proposed CasCenter.
External IDs:dblp:journals/tce/LiZYSZY25
Loading