Abstract: Siamese trackers based on deep networks impress by their advanced performance. However, the targets often have different appearances due to deformation, fast motion and so on. In this work, we propose a multi-head contrastive network, which constructs a separate embedding space for each layer in the backbone. The goal is to learn to capture the invariance of different appearances of the same target. Further, we propose global contextual consistency loss, which not only maintains the consistency of semantic information, but also maintains consistency in the spatial relationship and channel relationship of feature representations. Finally, we performed an ablation study to demonstrate the effectiveness of the proposed multi-head contrastive network. We achieve the state-of-the-art performance on five benchmark datasets, OTB-100, VOT2016, VOT2018, UAV123 and GOT10K. In particular, we achieve an AUC score of 70.9% on OTB-100.
External IDs:dblp:conf/icdis/GeLJLM22
Loading