Abstract: Object tracking is a well-established task in the field of computer vision. Despite numerous efforts over the last few decades, object tracking remains a formidable challenge, largely due to the complexity of the scenes encountered in video sequences. In line with the powerful capabilities of Transformers for object feature extraction and fusion, as well as drawing inspiration from CornerNet, this paper introduces a new Transformer-based anchor-free method designed for object tracking called CorPN (i.e., Corner Prediction Network). Specifically, a Transformer-based backbone is proposed to extract and fuse deep features in a Siamese network architecture. Furthermore, our methodology aims to emulate the intuitive cognitive process by which humans identify the bounding-box corners of objects. CorPN turns out to be a more effective and robust tracking method because it jointly leverages multi-level feature aggregation, corner prediction, and probabilistic prediction incorporated in the tracking head. In extensive evaluations on public datasets, our method achieves a state-of-the-art performance. In particular, our approach achieves an area-under-the-curve (AUC) of 85.2%, a precision score (P) of 84.5%, and a normalized precision score $(P_{norm})$ of 89.6%, outperforming extant state-of-the-art trackers on these metrics.
External IDs:dblp:journals/tce/LiZYDML25
Loading