Abstract: Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed ($7\times$). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training. We will release the code and dataset.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work focuses on 3D human model reconstruction based on discrete images and free-viewpoint video synthesis based on multi-view video, which involves multi-modal interaction between images, videos, and 3D models. This work contributes to multimedia processing by providing a method for real-time rendering and fast reconstruction of high-quality digital humans, which enhances the field's ability to virtual reality, gaming, sports broadcasting, and telepresence.
Supplementary Material: zip
Submission Number: 1456
Loading