Keywords: Talking Head, Image Animation, Image Synthesis
TL;DR: We proposed VividTalk, a generic talking head generation framework. Our method can generate high-visual quality talking head videos with natural facial motion, various head poses, and lip-sync enhanced by a large margin.
Abstract: Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, facial motion, head pose generation, and video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid facial motion and rigid head motion. For facial motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons. The code will be publicly released upon publication.
Supplementary Material: zip
Submission Number: 54
Loading