VGA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos

Xinqi Liu, Chenming Wu

Published: 2025, Last Modified: 15 Apr 2026CVM (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present VGA, an innovative framework designed for the reconstruction of vivid and high-fidelity 3D Gaussian avatars, incorporating comprehensive body and fine-grained finger control derived from monocular video inputs. Our contributions are twofold, focusing on the enhancement of pose alignment precision and the refinement of 3D Gaussian representation. First, we introduce a pose refinement methodology that augments the accuracy of hand and foot poses through the utilization of normal maps and silhouette alignment, thereby facilitating accurate shape and appearance modeling. Second, we tackle the challenges of unbalanced aggregation and initialization bias inherent in 3D Gaussian representation by proposing a surface-guided re-initialization strategy. This approach guarantees a more homogeneous distribution of 3D Gaussians, ensuring their effective alignment with the avatar’s potential surface, which in turn enhances rendering quality and stability under novel pose conditions. Extensive experimental evaluations demonstrate that our method achieves state-of-the-art performance in photo-realistic novel view synthesis, offering fine-grained control over body and finger movements. Both qualitative and quantitative analyses corroborate the robustness and expressiveness of our methodology, marking a substantial progression in the domain of 3D avatar reconstruction from monocular video.