MoDA: Multi-modal Diffusion Architecture for Talking Head Generation

13 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Talking Head Generation, Diffusion model, Data-Driven Animation
Abstract: Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Despite progress, current methods still struggle to synthesize diverse facial expressions and natural head movements while generating synchronized lip movements with the audio. The main challenge is stylistic discrepancies between speech audio, individual identity and portrait dynamics. To address the challenge of inter-modal inconsistency, we introduce MoDA, a multi-modal diffusion architecture with two well-designed technologies. First, MoDA models the interaction among motion, audio, and auxiliary conditions explicitly, enhancing overall facial expressions and head dynamics. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different conditions, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4850
Loading