Abstract: Audio-to-talking face generation stands at the forefront of advancements in generative AI. It bridges the gap between audio and visual representations by generating synchronized and realistic talking faces. Despite recent progress, the lack of realism in animated faces, asynchronous audio-lip movements, and computational burden remain key barriers to practical applications. To address these challenges, we introduce a novel approach, StableTalk, leveraging the emerging capabilities of Stable diffusion models and vision Transformers for Talking face generation. We also integrate the Re-attention mechanism and adversarial loss to improve the consistency of facial animations and synchronization with a given audio input. More importantly, the computational efficiency of our method has been notably enhanced by optimizing operations within the latent space and dynamically adjusting the focus on different parts of the visual content based on the provided conditions. Our experimental results demonstrate the superiority of StableTalk over the existing approaches in image quality, audio-lip synchronization, and computational efficiency.
Loading