Abstract: Recently, talking face generation has drawn considerable attention of researchers due to its wide applications. The lip synchronization accuracy and visual quality of the generated target speaker are very crucial for synthesizing photo-realistic talking face videos. Prior methods usually obtained unnatural and incongruous results. Or the generated ones comparatively has high fidelity, but only for a specific target speaker. In this paper, we propose a novel adversarial learning framework for talking face generation of arbitrary target speakers. To sufficiently provide visual information about the lip region in the video synthesis process, we introduce a spatial attention mechanism enabling our model to pay more attention to the lip region construction. In addition, we employ a content loss and a total variation regularization for our objective function in order to reduce lip shaking and artifacts in the deformed regions. Extensive experiments demonstrate that our method outperforms other representative approaches.
Loading