Talk2Me: High-Fidelity and Controllable Audio-Driven Avatars with Gaussian Splatting

16 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Talking head synthesis, 3D Gaussian Splatting
Abstract: Audio-driven avatars are increasingly employed in online meetings, virtual humans, gaming, and film production. However, existing approaches suffer from technical limitations, including low visual fidelity (e.g., facial collapse, detail loss) and limited controllability in expression and motion, such as inaccurate lip synchronization and unnatural head motion. Besides, most existing methods lack explicit modeling of the correlation between facial expressions and head pose dynamics, which compromises realism. To address these challenges, we propose Talk2Me, a high-fidelity, expressive, and controllable audio-driven framework comprising three core modules. Firstly, we enhance 3D Gaussian Splatting (3DGS) with Learnable Positional Encoding (LPE) and a modified Region-Weighted Mechanism to mitigate facial collapse and preserve fine details. Secondly, an Expression Generator (EG) with an Audio-Expression Temporal Fusion (AETF) module models the temporal relationship between audio and expression features, enabling accurate lip-sync and natural expression transitions. Thirdly, a Retrieval-Based Pose Generator (RBPG) explicitly captures the coupling between expressions and pose dynamics, with a Pose Refiner (PR) enhancing the naturalness and continuity of head motion. We further construct a Mandarin monocular video dataset featuring diverse identities to evaluate cross-lingual generalization. Experiments demonstrate that Talk2Me outperforms state-of-the-art methods in visual quality, synchronization accuracy, and motion naturalness.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7522
Loading