- Abstract: In this paper, we find that by designing a novel loss function entitled, ''tracking loss'', Convolutional Neural Network (CNN) based object detectors can be successfully converted to well-performed visual trackers without any extra computational cost. This property is preferable to visual tracking where annotated video sequences for training are always absent, because rich features learned by detectors from still images could be utilized by dynamic trackers. It also avoids extra machinery such as feature engineering and feature aggregation proposed in previous studies. Tracking loss achieves this property by exploiting the internal structure of feature maps within the detection network and treating different feature points discriminatively. Such structure allows us to simultaneously consider discrimination quality and bounding box accuracy which is found to be crucial to the success. We also propose a network compression method to accelerate tracking speed without performance reduction. That also verifies tracking loss will remain highly effective even if the network is drastically compressed. Furthermore, if we employ a carefully designed tracking loss ensemble, the tracker would be much more robust and accurate. Evaluation results show that our trackers (including the ensemble tracker and two baseline trackers), outperform all state-of-the-art methods on VOT 2016 Challenge in terms of Expected Average Overlap (EAO) and robustness. We will make the code publicly available.
- TL;DR: We successfully convert a popular detector RPN to a well-performed tracker from the viewpoint of loss function.
- Keywords: Object detection, Visual Tracking, Loss function, Region Proposal Network, Network compression