Keywords: 3D reconstruction, multi-view diffusion model, physical interaction, digital human
Abstract: Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods predominantly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporates group-level context and interaction priors. We introduce HUG3D, a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-aware Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and their group context to resolve occlusions and proximity. Subsequently, the Human Group-Aware Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture.
Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting groups from a single image.
Primary Area: generative models
Submission Number: 19527
Loading