ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

Wenhao Shen; Wanqi Yin; Xiaofeng Yang; Cheng Chen; Chaoyue Song; Zhongang Cai; Lei Yang; Hao Wang; Guosheng Lin

ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

Wenhao Shen, Wanqi Yin, Xiaofeng Yang, Cheng Chen, Chaoyue Song, Zhongang Cai, Lei Yang, Hao Wang, Guosheng Lin

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: A framework to align diffusion-based human mesh recovery methods via direct preference optimization.

Abstract: Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that **A**ligns a **D**iffusion-based **HMR** model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: [*https://github.com/shenwenhao01/ADHMR*](https://github.com/shenwenhao01/ADHMR).

Lay Summary: Turning a single photo into an accurate 3D model of a person is tricky: the picture hides parts of the body and gives no depth information, so computers must guess. Modern systems tackle this by generating many plausible body shapes through an iterative noise‑removal process called diffusion, but they still often misplace joints or fail on everyday “in‑the‑wild” photos. We introduce ADHMR, a technique that teaches the computer to prefer the guesses that actually line up with the picture. First, we build an automatic scorer that, like a referee, rates each 3D guess according to how well it matches visible body landmarks. Using these scores, we form pairs of “better” and “worse” examples and retrain the diffusion model—borrowing a learning rule from language models—so it consistently chooses the better one. ADHMR cuts pose‑position errors and works with far fewer guesses. The same scorer can also weed out bad training data, boosting other human‑modeling tools. More reliable 3D people models will enhance virtual try‑on, animation, and augmented‑reality applications.

Link To Code: https://github.com/shenwenhao01/ADHMR

Primary Area: Applications->Computer Vision

Keywords: human mesh recovery, direct preference optimization

Submission Number: 6115

Loading