ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffers from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contour that vary significantly along time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Multimodal Fusion, [Content] Vision and Language, [Generation] Generative Multimedia
Relevance To Conference: This work significantly contributes to multimedia and multimodal processing by addressing key challenges in generating high-fidelity talking avatars. By introducing the ConsistentAvatar framework, the paper tackles issues of temporal, 3D, and expression inconsistency inherent in existing diffusion models. The innovation lies in its approach of learning temporal representations and aligning them with ground truth data to ensure stability across adjacent frames. The Temporally-Sensitive Detail (TSD) map captures high-frequency features and contours that vary over time, guiding the diffusion process towards generating temporally stable avatars. Moreover, the method utilizes multimodal conditions including rough head normals and emotion prompt embeddings to enhance consistency and fidelity. This framework not only improves the generated appearance, 3D structure, expression, and temporal consistency but also surpasses state-of-the-art methods in these aspects. Overall, by effectively integrating temporal dynamics and multimodal conditions, ConsistentAvatar advances the capabilities of multimedia processing for generating lifelike talking avatars with enhanced consistency and fidelity.
Supplementary Material: zip
Submission Number: 1261
Loading