Abstract: Monocular 3D human pose estimation involves predicting the 3D pixel coordinates of key body joints from a 2D image or video. Typically, a 2D estimation model is employed to initially determine joint locations in an image, followed by training a separate model to lift these positions to 3D coordinates. In this paper, we evaluate the performance of recently proposed 2D human pose estimation models as different inputs for training and evaluation of 2D-3D lifting models. In addition, we propose four simple merging strategies to combine the outputs of these 2D human pose estimators and generate less noisy 2D inputs. To evaluate, four recent 2D pose estimators—ViTPose, PCT, MogaNet, and TransPose—are selected, and their corresponding 2D outputs are generated on the Human3.6M dataset. Subsequently, MotionAGFormer and PoseFormerV2 are trained and evaluated using each created 2D input and its corresponding 3D motion-capture ground truth. ViTPose stands out as the top-performing 2D estimator, and employing all merging strategies proves beneficial in generating a less noisy 2D input. Code and data are available at https://github.com/TaatiTeam/2DEstimatorEval.
Loading