Pedestrian Head Detection and Tracking via Global Vision Transformer

Xuan-Thuy Vo, Van-Dung Hoang, Duy-Linh Nguyen, Kang-Hyun Jo

Published: 01 Jan 2022, Last Modified: 05 Oct 2023IW-FCV 2022Readers: Everyone

Abstract: In recent years, pedestrian detection and tracking have significant progress in both performance and latency. However, detecting and tracking pedestrian human-body in highly crowded environments is a complicated task in the computer vision field because pedestrians are partly or fully occluded by each other. That needs much human effort for annotation works and complex trackers to identify invisible pedestrians in spatial and temporal domains. To alleviate the aforementioned problems, previous methods tried to detect and track visible parts of pedestrians (e.g., heads, pedestrian visible-region), which achieved remarkable performances and can enlarge the scalability of tracking models and data sizes. Inspired by this purpose, this paper proposes simple but effective methods to detect and track pedestrian heads in crowded scenes, called PHDTT (Pedestrian Head Detection and Tracking with Transformer). Firstly, powerful encoder-decoder Transformer networks are integrated into the tracker, which learns relations between object queries and image global features to reason about detection results in each frame, and also matches object queries and track objects between adjacent frames to perform data association instead of further motion predictions, IoU-based methods, and Re-ID based methods. Both components are formed into single end-to-end networks that simplify the tracker to be more efficient and effective. Secondly, the proposed Transformer-based tracker is conducted and evaluated on the challenging benchmark dataset CroHD. Without bells and whistles, PHDTT achieves 60.6 MOTA, which outperforms the recent methods by a large margin. Testing videos are available at https://bit.ly/3eOPQ2d .

0 Replies