Keywords: Neural Rendering, Neural (implicit) representations, 3D human body shape modeling, Generative models
Abstract: Over the past years, significant progress was made in creating photorealistic and drivable 3D avatars solely from videos of real humans.
However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions.
In particular, text-based edits of full-body avatars should satisfy two properties: 1) Spatio-temporal consistency, i.e. the dynamics, and the photo-real quality of the original avatar, should remain intact; 2) The final result should respect the user-specified edit. To this end, we present TEDRA the first method allowing text-based edits of an avatar, that are photorealistic, space-time coherent, dynamic, and enable skeletal pose and view control. We leverage a pre-trained avatar that is represented as a signed distance and radiance field, which is anchored to an explicit and deformable mesh template. After a pre-training stage, we obtain a drivable and photo-real digital counterpart of the real actor. Specifically, we employ an optimization strategy to integrate various frames capturing distinct camera perspectives and the dynamics of a video performance into a unified diffusion model. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt, introducing the Normal Aligned Identity Preserving Score Distillation Sampling (NAIP-SDS) within a model-based guidance framework. Additionally, we implement a time-step annealing strategy to ensure the high quality of our edits. Our results demonstrate a clear improvement over prior work in terms of functionality and visual quality.
Thus, our method is a clear step towards intuitive and photorealistic editability of digital avatars, which explicitly accounts for dynamics and allows skeletal pose and view control at test time.
Supplementary Material: zip
Submission Number: 106
Loading