Abstract: Recently, significant advancements have been made in audio-driven talking face generation. While GAN-based methods are widely used in this task, they struggle to achieve simultaneous lip accuracy and high-fidelity. Generated lip shapes tend to be overly influenced by the lip of reference images that provide identity information, leading to unstable and unsynchronized results. Moreover, the synthesized face frequently suffers from blurred teeth, skin textures, and compromised facial identity. To address these challenges, we propose an effective and innovative training strategy that simultaneously ensures lip synchrony and facial fidelity. First, we adaptively select the reference image using a hard-mining based strategy to prevent the network from simply copying the reference lip, enhancing the stability and synchronicity of lip movements. Second, we incorporate high-resolution facial images in training a quality discriminator within the GAN loss, improving the generated faces’ fidelity. Third, a global-to-detail training strategy is employed, starting with strengthening synchrony and then image quality to preserve identity and visual details. Experiments on the HDTF dataset demonstrate that our method achieves state-of-the-art performance in both lip accuracy and image quality.
External IDs:dblp:conf/icmcs/WangZSXCL25
Loading