One-Step Multi-Frame Inpainting Framework for Real-Time Lip-Sync Digital Human Generation

Published: 2025, Last Modified: 25 Jan 2026Int. J. Pattern Recognit. Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent times, audio-driven lip-synching generation for digital humans has attracted considerable attention. However, the prevailing methodologies frequently encounter challenges pertaining to elevated computational complexity and deficient real-time performance. Although the MuseTalk framework has achieved notable progress in inference efficiency through its end-to-end, latent-space-based single-step generation algorithm, it still suffers from noticeable lip jitter and insufficient synchronization between audio and lip movements. To address these limitations, we propose an enhanced multi-frame inpainting framework that integrates Variational Autoencoders (VAE) and a multi-scale U-Net architecture. Specifically, our approach directly synthesizes the occluded lip region by leveraging multi-frame visual references combined with corresponding audio embeddings, thereby effectively improving lip synchronization and maintaining identity consistency. Furthermore, we introduce a landmark-guided multi-frame sampling strategy designed to enhance model attention towards lip dynamics. To facilitate deeper feature extraction and fusion, we propose a hierarchical latent-space feature fusion network (FusionNet), incorporating global and local residual connections and an enhanced Convolutional Block Attention Module. Additionally, a frame interpolation technique is employed during inference to further smooth lip movements and significantly mitigate lip jitter. The model has been trained on a large-scale Chinese dataset and comprehensively evaluated using both Chinese and English datasets. The experimental results demonstrate that the proposed framework achieves high visual accuracy, consistent lip synchronization, and efficient real-time inference, highlighting its strong cross-lingual generalization capability.
Loading