Keywords: 3D Computer Vision, Neural Rendering, 3D Avatar Modeling
TL;DR: We present a method to generate a photorealistic 3D conversational avatar from a single image that is driven directly by audio for natural, synchronized speech and gestures.
Abstract: Prior conversational 3D avatar systems require mapping audio to parametric poses and then pass through rendering pipeline. This forms a lossy bottleneck and introduces cumulative errors at the the pose–to–render interface, where quantization, retargeting, and per-frame tracking errors accumulate. As a result, they struggle to maintain tight audio–motion synchronization and to express micro-articulations crucial for conversational realism—bilabial closures, cheek inflation, nasolabial dynamics, eyelid blinks, and fine hand gestures—issues that are amplified under single-image personalization. We address these limitations with an end-to-end framework that constructs a full-body, photorealistic 3D conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is represented as a particle-based deformation field of 3D Gaussian primitives in a canonical space; an audio-conditioned dynamics module produces audio-synchronous per-particle trajectories for face, hands, and body, enabling localized, high-frequency control while preserving global coherence. A splat-based differentiable renderer maintains identity, texture, and multi-view realism, and we further enhance synchronization and natural expressivity by distilling priors from a large audio-driven video diffusion model using feature-level guidance and weak supervision from synthetic, audio-conditioned clips. End-to-end training lets photometric and temporal objectives jointly shape the audio-conditioned deformation and rendering. Across diverse speakers and conditions, our method improves lip–audio synchronization, fine-grained facial detail, and conversational gesture naturalness over pose-driven baselines, while preserving identity from a single photo and supporting photorealistic novel-view synthesis—advancing accessible, high-fidelity digital humans for telepresence, assistants, and mixed reality.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5612
Loading