Avatar++: Fast and Pose-Controllable 3D Human Avatar Generation from a Single Image

Published: 14 Sept 2025, Last Modified: 13 Oct 2025ICCV 2025 Wild3DEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Avatar, Single-image-to-multi-view, Diffusion, Digital twins
TL;DR: Avatar++ is a fast, optimization-free pipeline that generates 3D human avatars from a single frontal image in under 15 seconds.
Abstract: We introduce Avatar++ as an optimization-free pipeline that converts a single frontal photograph into a 3D representation in a single forward pass, taking less than 15 seconds. Generating a human avatar from a single image is challenging due to the complex structure of the human body and the intricacies of facial features, and most existing models employ Score Distillation Sampling or iterative refinement methods to progressively enhance the generated textures. However, they have limitations of relying on computationally expensive and time-consuming optimization steps. To address these challenges, we propose a novel approach, named \emph{Avatar++}, that generates a human avatar through a fast and efficient single forward pass. Our model uses two different types of embeddings, one is facial identity and the other one is visual embedding. By combining two embeddings, our multi-view Diffusion Transformer (DiT) generates viewpoint-aligned images that preserve the subject’s facial identity. Additionally, we introduce an attention mechanism that propagates information from the input image during sampling to enhance visual quality. We additionally give guidance on the pose. This pose guidance allows the model to generate either a canonical pose (e.g., T-pose or A-pose) or replicate the pose from the input image using OpenPose. In addition to offering control over the pose in the generated multi-view images, this mechanism also enables the creation of animatable human avatars by generating canonical poses compatible with Gaussian Articulated Template Models. Canonical poses are especially advantageous for the animating process, as they typically provide less occluded views of the body, thereby improving reconstruction quality. These contributions position Avatar++ as a unified and efficient framework for generating identity-consistent and pose-controllable 3D human avatars from a single image. The proposed model achieves state-of-the-art performance on Thuman2.0 and RenderPeople benchmarks across all evaluation metrics, while delivering a 5× faster inference time than the fastest existing method.
Submission Number: 1
Loading