Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model

Wuyang Chen; Kele Xu; Yong Dou; Tian Gao

Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model

Wuyang Chen, Kele Xu, Yong Dou, Tian Gao

Published: 01 Jan 2024, Last Modified: 15 May 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this study, we explore the challenging task of generating facial images from unheard voices, aiming to synthesize similar faces that correspond to the voice identity. We design a novel framework that encompasses voice-face self-supervised representation learning and extends to voice-based face generation. The key idea behind the feasibility of cross-modal generation is that we not only enrich the voice representations by modeling the locally inherent correlations within voice data but also establish the cross-modal connections through aligning voice with paired face data. To enhance the association between voice and face, we further promote a false negative mitigation method. The learned voice representations are then fed into the diffusion model using cross-attention to produce an image. Experiments show that our framework outperforms previous state-of-the-art methods on various voice-face association evaluation tasks and yields substantially better images than prior approaches.

Loading