Practical Study of Deep Learning Models for Speech Synthesis

Quentin Langlois, Sébastien Jodogne

Published: 2023, Last Modified: 07 May 2026PETRA 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speech synthesis systems, also known as Text-To-Speech (TTS) systems, are increasingly frequent nowadays, with multiple applications such as voice assistants and screen readers for visually impaired or blind people. These applications require strong real-time capabilities to be usable in practice, which can be at the cost of a reduced quality in the synthesized voices. Deep Learning models, which have shown impressive results in the task of audio generation, are hardly ever used for everyday TTS because of their high demand in computational resources. Training such models also requires a large amount of good quality data, which is not available for most languages. This paper explores the benefits of cross-lingual transfer learning, both in terms of training time and amount of data that is needed to obtain good quality models. Our contributions are evaluated with respect to other TTS systems available for the French language. The main observation is that good quality single-speaker models can be trained within half a week on a single GPU, with a limited number of good quality data, by combining transfer learning with few-shot learning.

External IDs:dblp:conf/petra/LangloisJ23