Abstract: The existence of a learnable cross-modal association between a person's face and their voice is recently becoming more and more evident. This provides the basis for the task of target speaker text-to-speech (TTS) synthesis from face ref-erence. In this paper, we approach this task by proposing a cross-modal model architecture combining existing unimodal models. We use Tacotron 2 multi-speaker TTS with auditory speaker embeddings based on Global Style Tokens. We trans-fer learn a FaceNet face encoder to predict these embeddings from a static face image reference instead of a voice reference and thus predict a speaker's voice and speaking characteristics from their face. Compared to Face2Speech, the only existing work on this task, we use a more modular architecture that allows the use of openly available and pretrained model components. This approach enables high-quality speech synthesis and allows for an easily extensible model architecture. Ex-perimental results show good matching ability while retaining better voice naturalness than Face2Speech. We examine the limitations of our model and discuss multiple possible av-enues of improvement for future work.
0 Replies
Loading