Abstract: We introduce **StyleTalker**, a text-guided framework for editing and animating dynamic 3D head avatars from a monocular video. Current 3D scene editing techniques face two main challenges when applied in this task: 1) They typically require multi-view videos for accurate geometry reconstruction. Additionally, they are not suited for dynamic scenarios, making them ineffective for editing talking head avatars from a single-view video. 2) They struggle with fine-grained local edits, largely due to biases inherited from pre-trained 2D image diffusion models and limitations in detecting detailed facial landmarks. To overcome these challenges, we propose StyleTalker with two key innovations: **1)** A **mesh-enhanced 3D Gaussian reconstruction** approach that combines 3D head priors with multi-view video diffusion, improving the accuracy and flexibility of the reconstruction process. **2)** A **landmark-driven talking head editing** method that uses 3D facial landmarks to guide the editing process. By adjusting the strength of the edits based on the distance to these landmarks, our method ensures that the avatar's original identity is preserved while achieving the desired editing. Our extensive experiments demonstrate that StyleTalker outperforms current state-of-the-art methods, delivering high-quality edits and enabling the animation of avatars with diverse facial expressions, all based on a single-source video.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Adam_W_Harley1
Submission Number: 6201
Loading