PEAR: Pixel-aligned Expressive humAn mesh Recovery

Extracted image from PDF

Method overview

Extracted image from PDF

Abstract

Reconstructing 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing methods often produce coarse body poses and exhibit misalignments and unnatural artifacts in fine-grained regions such as the face and hands, which can progressively accumulate and lead to significant errors in downstream tasks. To address this issue, we propose PEAR—a unified framework for human mesh recovery and rendering. PEAR ~explicitly tackles two major limitations of current methods: inaccurate localization of fine-grained human pose details and insufficient photometric supervision for self-reconstruction. Specifically, we train a Transformer-based model that can recover expressive 3D human geometry (SMPLX + FLAME) from a single image without cropping specific body parts. This preprocessing-free design enables real-time inference at over 100 FPS. Furthermore, we integrate the model with a neural renderer to jointly optimize geometry and appearance, which significantly enhances the reconstruction accuracy of fine-grained human geometry and yields higher-quality rendering results. Lastly, we curate a large-scale dataset of images and videos with human pose and keypoint annotations to facilitate model training. Extensive experiments on multiple benchmark datasets demonstrate that the proposed approach achieves significant improvements in both geometric reconstruction accuracy and rendering quality.

Head mesh recovery

Our approach attains highly detailed facial alignment, enabling the capture of more nuanced expressions.

OSX

SMPLest

Multi-HMR

Multi-HMR (failed)

Ours

Ubody mesh recovery

Our method achieves more accurate alignment with actual motion in both the face and hands.

OSX

SMPLest

Multi-HMR

Ours

WholeBody mesh recovery

Our method achieves finer pixel-level alignment across the entire human motion, rather than exhibiting the large offsets seen in other approaches.

OSX

SMPLest

Multi-HMR

Ours

Downstream application

Benefiting from PEAR’s fast inference speed (100 FPS), the system functions as a real-time animation interface, estimating SMPL-X and FLAME parameters from video streams and driving animations at 50 FPS.

Realtime Animation.

Drive a wider variety of identities

Cartoon Animation

Some extreme cases

We showcase several extreme cases, such as motion blur, occlusions, strong illumination, as well as loose clothing and long hair.

Loose clothing and hair

Extracted image from PDF