Reconstructing 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing methods often produce coarse body poses and exhibit misalignments and unnatural artifacts in fine-grained regions such as the face and hands, which can progressively accumulate and lead to significant errors in downstream tasks. To address this issue, we propose PEAR—a unified framework for human mesh recovery and rendering. PEAR ~explicitly tackles two major limitations of current methods: inaccurate localization of fine-grained human pose details and insufficient photometric supervision for self-reconstruction. Specifically, we train a Transformer-based model that can recover expressive 3D human geometry (SMPLX + FLAME) from a single image without cropping specific body parts. This preprocessing-free design enables real-time inference at over 100 FPS. Furthermore, we integrate the model with a neural renderer to jointly optimize geometry and appearance, which significantly enhances the reconstruction accuracy of fine-grained human geometry and yields higher-quality rendering results. Lastly, we curate a large-scale dataset of images and videos with human pose and keypoint annotations to facilitate model training. Extensive experiments on multiple benchmark datasets demonstrate that the proposed approach achieves significant improvements in both geometric reconstruction accuracy and rendering quality.
Our approach attains highly detailed facial alignment, enabling the capture of more nuanced expressions.
OSX
SMPLest
Multi-HMR (failed)
Ours
Our method achieves more accurate alignment with actual motion in both the face and hands.
OSX
SMPLest
Multi-HMR
Ours
Our method achieves finer pixel-level alignment across the entire human motion, rather than exhibiting the large offsets seen in other approaches.
OSX
SMPLest
Multi-HMR
Ours
Benefiting from PEAR’s fast inference speed (100 FPS), the system functions as a real-time animation interface, estimating SMPL-X and FLAME parameters from video streams and driving animations at 50 FPS.
Realtime Animation.
Drive a wider variety of identities
Cartoon Animation
We showcase several extreme cases, such as motion blur, occlusions, strong illumination, as well as loose clothing and long hair.
Loose clothing and hair