Abstract: Visual mask learning has received increasing attention in the field of visual object tracking. However, most existing studies merely utilize visual mask learning works as pre-training models without fully exploiting their potential for visual representation. In this paper, we present a novel approach for learning tracking target features, leveraging an encoder-decoder architecture with a masked mutual guidance tracking(MMG). Initially, we perform joint visual feature extraction on both the template and search areas. Subsequently, these features undergo separate self-decoding processes, followed by mutual guidance decoding to reconstruct the original search and template images. This process fosters mutual understanding between the images, facilitating improved learning of object states and shapes across different frames. During the inference phase, we offload the decoder and implement a simple and effective tracker. Experimental results indicate that our proposed method is effective that the mutual guidance strategy can achieve state-of-the-art performance on five tracking datasets.
Loading