Abstract: Multiple Object Tracking (MOT) in computer vision is a fundamental task focused on identifying and monitoring the movement of multiple objects within a video sequence. MOT plays a crucial role in various applications, including surveillance, autonomous driving, and human-computer interaction. The primary objective is to consistently and accurately follow the trajectories of individual objects across frames while dealing with challenges such as occlusions, and varying appearances. This research paper presents an approach for tackling the challenging task of multiple categories of object tracking using deep learning techniques, combined with the utilization of enriching contextual features during training. In this study, we address the complexities of tracking objects by using temporal-wise similarity to improve features for consecutive frames. To enhance the performance of our tracking framework, we introduce a training strategy by adding to the original dataset a sub-dataset wherein large input images are divided into sub-patches to reach competitive results regarding tracking accuracy and precision with 72.4% and 81.6% on MOT dataset, respectively.
Loading