Domain-Guided Spatio-Temporal Self-Attention for Egocentric 3D Pose Estimation

Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha Rambhatla, Paul W. Fieguth

Published: 01 Jan 2023, Last Modified: 28 Feb 2024KDD 2023Readers: Everyone

Abstract: Vision-based ego-centric 3D human pose estimation (ego-HPE) is essential to support critical applications of xR-technologies. However, severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera, make ego-HPE extremely challenging. To address these challenges, we propose a domain-guided spatio-temporal transformer model that leverages information specific to ego-views. Powered by this domain-guided transformer, we build Egocentric Spatio-Temporal Self-Attention Network (Ego-STAN), which uses 2D image representations and spatio-temporal attention to address both distortions and self-occlusions in ego-HPE. Additionally, we introduce a spatial concept called feature map tokens (FMT) which endows Ego-STAN with the ability to draw complex spatio-temporal information encoded in ego-centric videos. Our quantitative evaluation on the contemporary xR-EgoPose dataset, achieves a 38.2% improvement on the highest error joints against the SOTA ego-HPE model, while accomplishing a 22% decrease in the number of parameters. Finally, we also demonstrate the generalization capabilities of our model to real-world HPE tasks beyond ego-views achieving 7.7% improvement on 2D human pose estimation with the Human3.6M dataset. Our code is also made available at: https://github.com/jmpark0808/Ego-STAN

0 Replies