From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech


Sep 25, 2019 ICLR 2020 Conference Blind Submission readers: everyone Show Bibtex
  • TL;DR: This paper proposes a method of end-to-end multi-modal generation of human face from speech based on a self-supervised learning framework.
  • Abstract: This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and generation stage. First, the inference networks are trained to match the speaker identity between the two different modalities. Then the pre-trained inference networks cooperate with the generation network by giving conditional information about the voice.
  • Keywords: Multi-modal learning, Self-supervised learning, Voice profiling, Conditional GANs
0 Replies