Event-guided Video Transformer for End-to-end 3D Human Pose Estimation

Mooi Choo Chuah, Bo Lang

Published: 28 Feb 2025, Last Modified: 08 Jul 2025OpenReview Archive Direct UploadEveryoneCC BY-NC-ND 4.0

Abstract: 3D human pose estimation (3D HPE) is an important computer vision task with various practical applications. However, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging. Recent transformer-based approaches focus on capturing the spatial-temporal information from sequential 2D poses, which unfortunately loses the visual feature relevant for 3D pose estimation. In this paper, we propose an end-to-end framework called Event Guided Video Transformer (EVT) which predicts 3D poses directly from video frames by learning spatial-temporal contextual information from visual features effectively. In addition, our design is the first that incorporates event features to help guide 3D pose estimation. EVT first decouples persons into different instance-aware feature maps from video frames. These features containing specific clues of body structure information are then fed together with event features into an attention based Event-Aware Embedding Module. Next, the fused features for each instance are then fed into an intra-human relation extraction module and subsequently to a temporal transformer to extract inter-frame relationship. Finally, the extracted features are fed into a decoder for 3D pose estimation. Experiments using three widely used 3D pose estimation benchmarks show that our proposed EVT achieves better performance than state-of-the-art models.