Abstract: One-stream transformer trackers have received widespread attention for their excellent discriminatory ability. However, most of the existing trackers try to mine more information on the target while ignoring the exploitation of the background around it. In this study, we present a one-stream target–background interaction modeling transformer for object tracking. On the one hand, to mitigate the effect of interclass scenarios, a transformer-based target–background interaction model is proposed. This model performs multiview spatiotemporal attention modeling from 2D to 3D, fully exploring the relational dependencies between the target context and the search region. The model maximizes the acquisition of high-quality tokens for target representation by minimizing the influence of pure background tokens. On the other hand, by considering the intraclass strong similarity distractors, a progressive state-aware module is designed to optimize the feature structure used for representing the target. This module aggregates the historical state of the target into the self-attention mechanism, which learns the positional relationship between the target and the distractors in the current frame, to adaptively highlight the weight of the target. By jointly learning interclass variability and intraclass similarity, our method can more accurately reveal the composition and structure of target features, improving its visual perception in complex scenes. Extensive experiments on seven benchmarks compared to existing state-of-the-art trackers demonstrate the effectiveness of our proposed method.
External IDs:dblp:journals/kbs/ZhangFQZWW25
Loading