Keywords: Talking Head, Video Generation, Diffusion Models
Abstract: Advances in video diffusion models have unlocked the potential for realistic audio-driven talking video generation. However, it is still highly challenging to ensure seamless audio-lip synchronization, maintain long-term identity consistency, and achieve natural expressions aligned with the audio in generated talking videos. To address these challenges, we propose **M**emory-guided **EMO**tion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and smooth motion by developing memory states that store information from all previously generated frames and guide temporal modeling through linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from the audio to refine facial expressions via emotion adaptive layer norm. Moreover, MEMO is trained on a large-scale, high-quality dataset of talking head videos without relying on facial inductive biases such as face landmarks or bounding boxes. Extensive experiments demonstrate that MEMO generates more realistic talking videos across a wide range of audio types, surpassing state-of-the-art talking video diffusion methods in human evaluations in terms of emotion-audio alignment, identity consistency and overall quality, respectively.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2134
Loading