TL;DR: We propose a novel RVQ-based discrete diffusion framework DEmoFace for a new task (Emotional Face-to- Speech, eF2S), which is the first attempt to customize consistent vocal style including timbre and emotional prosody solely from the face.
Abstract: How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed *emotional face-to-speech*, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce **DEmoFace**, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos of DEmoFace are shown at our project https://demoface.github.io.
Lay Summary: How much can we infer about a person’s authentic vocal style—including their voice timbre and emotional prosody—just by observing their facial expressions? This intriguing question has wide-ranging applications, from dubbing virtual characters to assisting individuals with expressive language disorders. To that end, we introduce DEmoFace, a novel system that generates emotional speech based only on visual information with diverse expression. Specifically, DEmoFace is built upon an advanced diffusion framework, conditioned on visual inputs for acoustic characteristics and text inputs for semantic content. We further develop a multi-conditional guidance mechanism to improve the fidelity to multimodal conditions. Extensive experimental results demonstrate that DEmoFace produces speech with greater naturalness and vocal style consistency compared to existing approaches, serving as a foundation for multimodal personalized text-to-speech systems.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Language, Speech and Dialog
Keywords: Generative Model, Discrete Diffusion Model, Speech Generation, Text-to-Speech
Submission Number: 311
Loading