Abstract: Creating high-quality, photorealistic 3D digital humans from a single image remains challenging. While existing methods can generate visually appealing multi-view outputs, they often suffer from inconsistencies in viewpoints and camera poses, resulting in suboptimal 3D reconstructions with reduced realism. Furthermore, most approaches focus on body generation while overlooking facial consistency – a perceptually critical issue caused by the fact that the face occupies only a small area in a full-body image (e.g., ∼ 80 × 80 pixels out of a 512 × 512 image). This limited resolution and low weight for the facial regions during optimization leads to insufficient facial details and inconsistent facial identity features across multiple views.To address these challenges, we leverage the powerful capabilities of 2D video diffusion models for consistent multi-view RGB and Normal human image generation, combined with the 3D SMPL-X representation to enable spatial consistency and geometrical details. By fine-tuning the DiT models (HumanWan-DiTs) on realistic 3D human datasets using the LoRA technique, our method ensuresboth generalizability and 3D visual consistency on realistic multi-view human image generation. The proposed facial enhancement is integrated into 3D Gaussian optimization to enhance facial details. To further refine results, we apply super-resolution and generative priors to reduce facial blurring alongside SMPL-X parameter tuning and the assistance of generated multi-view normal images, achieving photorealistic and consistent rendering from a single image. Extensive experiments demonstrate that our approach outperforms existing methods, producing photorealistic, consistent, and fine-detailed human renderings.
External IDs:doi:10.1145/3757377.3763839
Loading