Abstract: In this study, we explore the challenging task of generating facial images from unheard voices, aiming to synthesize similar faces that correspond to the voice identity. We design a novel framework that encompasses voice-face self-supervised representation learning and extends to voice-based face generation. The key idea behind the feasibility of cross-modal generation is that we not only enrich the voice representations by modeling the locally inherent correlations within voice data but also establish the cross-modal connections through aligning voice with paired face data. To enhance the association between voice and face, we further promote a false negative mitigation method. The learned voice representations are then fed into the diffusion model using cross-attention to produce an image. Experiments show that our framework outperforms previous state-of-the-art methods on various voice-face association evaluation tasks and yields substantially better images than prior approaches.
Loading