SAGE: Fast, Generalizable and Photorealistic 3D Human Reconstruction from a Single Image

Hezhen Hu; Wangbo Zhao; Lanqing Guo; Hanwen Jiang; Jonathan C. Liu; Suya You; Kai Wang; Zhangyang Wang; Georgios Pavlakos

SAGE: Fast, Generalizable and Photorealistic 3D Human Reconstruction from a Single Image

Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Suya You, Kai Wang, Zhangyang Wang, Georgios Pavlakos

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Human Reconstruction; Single Image; Large Human Reconstruction Model

TL;DR: We propose a Large Human Reconstruction Model, which can produce a photorealistic 3D reconstruction of a human from a single image in less than 1 second.

Abstract: In this paper, we present SAGE, a Large Human Reconstruction Model, that can produce a photorealistic 3D reconstruction of a human from a single image in less than 1 second. To support scalable model training, we first design an effective data generation pipeline to alleviate the shortage of available photorealistic 3D human data. In this pipeline, we follow two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, our framework is inspired by Large Reconstruction Models (LRMs) and extracts tokenized features from the input image and the estimated simplified human mesh (SMPL) without detailed geometry or appearance. A mapping network takes this tokenized information as conditioning and employs a cross-attention mechanism to iteratively enhance an initial feature representation. Ultimately, the output is a triplane representation that depicts the 3D human, while novel views are rendered using a standard ray marching method given a camera viewpoint. Extensive experiments on three benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24972

Loading