Abstract: We propose a technique of 3D human skeleton estimation from RGB image sequence. Our method uses two-stages of deep learning networks. The first stage is to estimate enhanced 2D skeletons (2D image coordinates and relative-depths for all joints). The sequence of enhanced 2D skeletons is represented as a spatial-temporal graph (STG) which is then input to the second stage composed of Graph Convolutional Network (GCN) as the backbone. Techniques of high-order feature representations for joints, multi-stream feature adjustments, and denoising, were developed to further promote the accuracy performance for the estimated 3D skeletons. The Human3.6M dataset was used for training and testing. Experimental results show that our multi-stream GCN-based network can extract useful information from the input sequence efficiently. From experiments, the mean per joint position error (MPJPE) of the 3D skeletal joints is 47.27 mm when a sequence of 31 RGB frames are considered.
Loading