Learning Structure-Dynamics-aware Representations for Efficient and Robust 3D Pose Estimation

Mengmeng Cui; Wei Wang; Kunbo Zhang; Zhenan Sun

Learning Structure-Dynamics-aware Representations for Efficient and Robust 3D Pose Estimation

Mengmeng Cui, Wei Wang, Kunbo Zhang, Zhenan Sun

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D pose estimation, Part-based Adaptive GNN, Frameset-based Skipped Transformer, efficient and robust

TL;DR: We propose a compact Graph and Skipped Transformer architecture to realise efficient and robust 2D-to-3D Human Pose Estimation.

Abstract: Recent works in 2D-to-3D pose uplifting for monocular 3D Human Pose Estimation (HPE) have shown significant progress. However, two key challenges persist in real-world applications: vulnerability to joint noise and high computational costs. These issues arise from the dense joint-frame connections and iterative correlations typically employed by mainstream GNN-based and Transformer-based methods. To address these challenges, we propose a novel approach that leverages human physical structure and long-range dynamics to learn spatial part- and temporal frameset-based representations. This method is inherently robust to missing or erroneous joints while also reducing model parameters. Specifically, in the Spatial Encoding stage, coarse-grained body parts are used to construct structural correlations with a fully adaptive graph topology. This spatial correlation representation is integrated with muti-granularity pose attributes to generate a comprehensive pose representation for each frame. In Temporal Encoding and Decoding stages, Skipped Self-Attention is performed in framesets to establish long-term temporal dependencies from multiple perspectives of movement. On this basis, a compact Graph and Skipped Transformer (G-SFormer) is proposed, which realises efficient and robust 3D HEP in both experimental and practical scenarios. Extensive experiments on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks demonstrate that G-SFormer series models can compete and outperform the state-of-the-arts but takes only a fraction of parameters and around 1\% computational cost. It also exhibits outstanding robustness to inaccurately detected 2D poses. The source code will be available at https://sites.google.com/view/g-sformer.

Supplementary Material: pdf

Primary Area: learning on time series and dynamical systems

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3002

Loading