DisFlowEm : One-Shot Emotional Talking Head Generation Using Disentangled Pose and Expression Flow-Guidance

Published: 01 Jan 2025, Last Modified: 28 Apr 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generating realistic one-shot emotional talking head animation on arbitrary faces is a challenging problem, as it requires realistic emotions, head movements, identity preser-vation, and accurate lip sync. Existing emotional talking face generation methods either fail to retain the identity information of arbitrary subjects owing to the limited variability of existing emotional datasets, or they fail to capture emotions accurately even if they preserve identity of arbitrary faces. Moreover, most of the methods rely on additional input videos for driving poses and/or or expressions on the generated video. For practical applications, it is in-feasible to obtain driving videos of the same or different subject with variations in head pose, expressions etc. In this paper, we propose a novel approach for Audio-driven Emotional Talking Head generation from a single image, with emotion-controllable head pose generation. Unlike existing methods, our method does not require a driving video either for pose or emotions, and can generate different emotions and diverse head pose variations from input speech and a single image of an arbitrary subject in neutral emotion. Our method overcomes the limitations of existing emotional audio-visual datasets by learning a disentangled approach for optical flow computation approach for pose and expression. Using our proposed method of independently computing pose-driven and expression-driven optical flow, our image generation network can be pretrained on a large dataset with greater pose variability but lacking emotion annotations. The expression flow generation branch is fine-tuned on a smaller emotional dataset to accurately capture different emotions not present in the original dataset, while retaining the pose variability from the original dataset. We present extensive experiments to demonstrate the superior-ity of our proposed method in generating talking head animation with accurate emotions, diverse head movements, and generalization to arbitrary faces.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview