Pseudo View Representation Learning for Monocular RGB-D Human Pose and Shape Estimation

Published: 10 Dec 2021, Last Modified: 02 Apr 2025IEEE Signal Processing LettersEveryoneRevisionsCC BY 4.0
Abstract: This work studies the problem of estimating human pose and shape from monocular RGB-D images. Depth information in the RGB-D input allows accurate 3D human reconstruction. However, the limited sizes of the RGB-D datasets restrict the generalization ability of the existing RGB-D based methods. In this letter, we propose a novel architecture, View Render Net (VRNet), that exploits the underlying 3D structure in the RGB-D input and utilizes additional RGB datasets for training. VRNet synthesizes pseudo multi-view representations via a novel feature render approach. The synthetic multi-view feature maps are aggregated with Multi-stage Multi-view Fusion (MMF) to impose the 3D structure constraints. We show that VRNet can naturally adopt the mixed RGB/RGB-D training and inference technique, which improves the model generalization ability. Comprehensive experiments are conducted to study the effectiveness of VRNet. As an illustrative example, the proposed VRNet improves MPJPE and PA-MPJPE by 15.2 mm and 4.9 mm on the Human3.6M dataset.
Loading