Adaptive Vision Transformer for Event-Based Human Pose Estimation

yunannan; Tao Ma; Jiqing Zhang; Yuji Zhang; Qirui Bao; Xiaopeng Wei; Xin Yang

Adaptive Vision Transformer for Event-Based Human Pose Estimation

yunannan, Tao Ma, Jiqing Zhang, Yuji Zhang, Qirui Bao, Xiaopeng Wei, Xin Yang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Human pose estimation has made progress based on deep learning. However, it still faces challenges when encountering exposure, low light, and high-speed scenarios such as motion blur and miss human contours in low light scenes. Moreover, due to the extensive operations required for large-scale convolutional neural network (CNN) inference, marker-free human pose estimation based on standard frame-based cameras is still slow and power consuming for real-time feedback interaction. Event-based cameras quickly output asynchronous sparse moving-edge information, which is low latency and low power consumption for real-time interaction with human pose estimators. For further study. this paper proposed a high-frame rate labeled event-based human pose estimation dataset named Event Multi Movement HPE (EventMM HPE). It consists of records from synchronized event camera, high frame rate camera and Vicon motion capture system, with each sequence recording multiple action combinations and high frame rate (240Hz) annotations. This paper also proposed an event-based human pose estimation model, which utilizes adaptive patches to efficiently achieves good performance for the sparse and reduced input data from DVS. The source code, dataset, and pre-trained models will be released upon acceptance.

Primary Subject Area: [Experience] Interactions and Quality of Experience

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: Our work introduces a dynamic visual sensor to solve the challenge of human pose estimation in scenarios with high dynamic range and fast-moving objects. The innovation of this method lies in the introduction of an Adaptive Vision Transformer (adaptive sampling module and adaptive dropout module) to handle asynchronous discrete event data. We have validated the effectiveness of the two innovative modules in our approach on both a custom dataset and the public dataset DHP19, providing a solution for real-time human pose interaction in real-world settings.

Submission Number: 3770

Loading