Abstract: Transformer architecture has been showing its great strength in visual object tracking, for its effective attention mechanism. Existing transformer-based approaches adopt the pixel-to-pixel attention strategy on flattened image features and unavoidably ignore the integrity of ob-jects. In this paper, we propose a new transformer ar-chitecture with multi-scale cyclic shifting window attention for visual object tracking, elevating the attention from pixel to window level. The cross-window multi-scale at-tention has the advantage of aggregating attention at dif-ferent scales and generates the best fine-scale match for the target object. Furthermore, the cyclic shifting strat-egy brings greater accuracy by expanding the window sam-ples with positional information, and at the same time saves huge amounts of computational power by removing redun-dant calculations. Extensive experiments demonstrate the superior performance of our method, which also sets the new state-of-the-art records on five challenging datasets, along with the VOT2020, UAV123, LaSOT, TrackingNet, and GOT-lOk benchmarks. Our project is available at https://github.com/SkyeSong38/CSWinTT.
0 Replies
Loading