Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation

Published: 01 Jan 2024, Last Modified: 04 Mar 2025ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generating novel voices in speech synthesis is a challenging task with potential for creating versatile voices that are needed in entertainment and research. One of the primary obstacles in this area is the lack of well-annotated voice descriptions for expressive speech corpora. Our research aims to address this issue by representing speaker styles from vision. We introduce Stable Diffusion-Enhanced Voice Generation (SD-EVG), which leverages Stable Diffusion to generate imaginary facial images for new voice generation. To create a reference set of facial images based on realistic voices, SD-EVG employs a transformer encoder and a Stable Diffusion decoder to visualize the speaker's face. Subsequently, SD-EVG uses a KNN-based approach to map facial features to speech style for voice generation. The experiments demonstrate that the voices generated from the imagined facial data have better potential at capturing speech style than text-based methods for the same descriptions.11A demo website featuring the generated face and speech utterances is available at https://sd-evg.github.io.
Loading