Keywords: monocular pose estimation, data augmentation
Abstract: While recent two-stage many-to-one deep learning models have demonstrated great success in 3D human pose estimation, such models are inefficient in 3D key point detection and also tend to pass on first stage errors onto the second stage. In this paper, we introduce SoloPose, a novel one-stage, many-to-many spatio-temporal transformer model for kinematic 3D human pose estimation of video. SoloPose is further fortified by HeatPose, a 3D heatmap based on Gaussian Mixture Model distributions that factors target key points as well as kinematically adjacent key points. Finally, we address data diversity constraints with the 3D AugMotion Toolkit, a methodology to augment existing 3D human pose datasets, specifically by projecting four top public 3D human pose datasets (Human3.6M, MADS, AIST Dance++, MPI INF 3DHP) into a novel dataset (Human7.1M) with a universal coordinate system. Extensive experiments are conducted on both Human3.6M and the augmented Human7.1M dataset, and SoloPose demonstrates superior results relative to the state-of-the-art approaches.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2043
Loading