Text-Driven Synchronized Diffusion Video and Audio Talking Head Generation

Zhenfei Zhang, Tsung-Wei Huang, Guan-Ming Su, Ming-Ching Chang, Xin Li

Published: 2024, Last Modified: 12 Nov 2025MIPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose a lightweight approach to address the challenge of ultra-low bitrate video conferencing and other scenarios requiring synchronized audio and video generation from text inputs. We leverage the Denoising Diffusion Probabilistic Model (DDPM) and a dual-branch U-Net architecture to generating a high-quality talking head model with lip synchronization from a textual input. Our method enables the transmission of only low-bitrate text during communication, ensuring efficient data transfer. Furthermore, our system offers users various options for generating video content. They can input video captions, speech text, or choose to generate content without text input altogether. Additionally, users can customize the style of the output video by using static images. By revolutionizing the generation and transmission of video and audio, this work improves the efficiency and versatility of multimedia applications.