UNSPAT: Uncertainty-Guided SpatioTemporal Transformer for 3D Human Pose and Shape Estimation on Videos

Minsoo Lee, Hyunmin Lee, Bumsoo Kim, Seunghwan Kim

Published: 01 Jan 2024, Last Modified: 13 Nov 2024WACV 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose an efficient framework for 3D human pose and shape estimation from a video, named Uncertainty-Guided SpatioTemporal Transformer (UNSPAT). Unlike previous video-based methods that consider temporal relationships with global average pooled features, our approach incorporates both spatial and temporal dimensions without compromising spatial information. We address the excessive complexity of spatiotemporal attention through two modules: Spatial Alignment Module (SAM) and Space2Batch. The modules align input features and compute temporal attention at every spatial position in a batch-wise manner. Furthermore, our uncertainty-guided attention re-weighting module improves performance by diminishing the impact of artifacts. We demonstrate the effectiveness of the UNSPAT on widely used benchmark datasets and achieve state-of-the-art performance. Our method is robust to challenging scenes, such as occlusion, and cluttered backgrounds, showing its potential for real-world applications.