Abstract: 3D human pose estimation (3D HPE) is an important computer vision task with various practical applications. However, 3D pose estimation for multi-person from
a monocular video (3DMPPE) is particularly challenging. Recent transformer-based approaches focus on capturing the spatial-temporal information from sequential 2D
poses, which unfortunately loses the visual feature relevant for 3D pose estimation. In this paper, we propose
an end-to-end framework called Event Guided Video Transformer (EVT) which predicts 3D poses directly from video
frames by learning spatial-temporal contextual information
from visual features effectively. In addition, our design is
the first that incorporates event features to help guide 3D
pose estimation. EVT first decouples persons into different instance-aware feature maps from video frames. These
features containing specific clues of body structure information are then fed together with event features into an attention based Event-Aware Embedding Module. Next, the fused
features for each instance are then fed into an intra-human
relation extraction module and subsequently to a temporal
transformer to extract inter-frame relationship. Finally, the
extracted features are fed into a decoder for 3D pose estimation. Experiments using three widely used 3D pose estimation benchmarks show that our proposed EVT achieves
better performance than state-of-the-art models.
Loading