Control-Talker: A Rapid-Customization Talking Head Generation Method for Multi-Condition Control and High-Texture Enhancement

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In recent years, the field of talking head generation has made significant strides. However, the need for substantial computational resources for model training, coupled with a scarcity of high-quality video data, poses challenges for the rapid customization of model to specific individual. Additionally, existing models usually only support single-modal control, lacking the ability to generate vivid facial expressions and controllable head poses based on multiple conditions such as audio, video, etc. These limitations restricts the models' widespread application. In this paper, we introduce a two-stage method called Control-Talker to achieve rapid customization of identity in talking head model and high-quality generation based on multimodal conditions. Specifically, we divide the training process into two stages: prior learning stage and identity rapid-customization stage. 1) In the prior learning stage, we leverage a diffusion-based model pre-trained on the high-quality image dataset to acquire a robust controllable facial prior. Meanwhile, we innovatively propose a high-frequency ControlNet structure to enhance the fidelity of the synthesized results. This structure adeptly extracts a high-frequency feature map from the source image, serving as a facial texture prior, thereby excellently preserving facial texture of the source image. 2) In the identity rapid-customization stage, the identity is fixed by fine-tuning the U-Net part of the diffusion model on merely several images of a specific individual. The entire fine-tuning process for identity customization can be completed within approximately ten minutes, thereby significantly reducing training costs. Further, we propose a unified driving method for both audio and video, utilizing FLAME-3DMM as an intermediary representation. This method equips the model with the ability to precisely control expressions, poses, and lighting under multi conditions, significantly broadening the application fields of the talking head model. Extensive experiments and visual results demonstrate that our method outperforms other state-of-the-art models. Additionally, our model demonstrates reduced training costs and lower data requirements.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: In this paper, we propose a unified driving method for both audio and video using FLAME-3DMM as an intermediary representation. The proposed model capture facial priors from high-quality datasets in the prior learning stage and fix the character's identity in the identity rapid-customization stage, which not only fully leverages the priors of high-quality datasets but also reduces the computational resources and data quality requirements for training specific individuals. Furthermore, our proposed HF-ControlNet enhances the texture quality in talking head synthesis results by extracting high-frequency feature map of face images. In summary, our model enables low-cost and rapid customization of talking head model, and supports multi-conditional control of talking head video synthesis.
Supplementary Material: zip
Submission Number: 5271
Loading