Outstanding Orthodontist: No More Artifactual Teeth in Talking Face

Published: 01 Jan 2025, Last Modified: 07 Nov 2025IJCAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio-driven talking face synthesis (TFS) enables the creation of realistic speaking videos by combining a single facial image with a speech audio clip. Unlike other facial features that naturally deform during speech, teeth represent unique rigid structures whose shape and size should remain constant throughout the video sequence. However, current methods often produce temporal inconsistencies and artifacts in the teeth region, resulting in a less realistic appearance of the generated videos. To address this, we propose OrthoNet, a plug-and-play framework designed to eliminate unrealistic teeth effects in audio-driven TFS. Our method introduces a Detail-oriented Teeth Aligner module, designed to preserve teeth details and adapt to their shape. It works with a Memory-guided Teeth Stabilizer that integrates a long-term memory bank for global teeth structure and a short-term memory module for local temporal dynamics. Through this framework, OrthoNet acts like an orthodontist for existing Audio2Video methods, ensuring that teeth maintain natural rigidity and temporal consistency even under varying degrees of teeth occlusion. Extensive experiments demonstrate that our method makes the teeth in generated videos appear more natural during speech, significantly enhancing the temporal consistency and structural stability of audio-driven video generation.
Loading