- TL;DR: This paper proposes a method of end-to-end multi-modal generation of human face from speech based on a self-supervised learning framework.
- Abstract: This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and generation stage. First, the inference networks are trained to match the speaker identity between the two different modalities. Then the pre-trained inference networks cooperate with the generation network by giving conditional information about the voice.
- Keywords: Multi-modal learning, Self-supervised learning, Voice profiling, Conditional GANs