Abstract: India is a country with thousands of languages and dialects spoken across a billion-strong population. For multi-lingual content creation and accessibility, text-to-speech systems will play a crucial role. However, the current neural TTS systems are data-hungry and need about 20 hours of clean single-speaker speech data for each language and speaker. This is not scalable for the large number of Indian languages and dialects. In this work, we demonstrate three simple, yet effective pre-training strategies that allow us to train neural TTS systems with just about one-tenth of the data needs while also achieving better accuracy and naturalness. We show that such pre-trained neural TTS systems can be quickly adapted to different speakers across languages and genders with less than 2 hours of data, thus significantly reducing the effort for future expansions to the thousands of rare Indian languages. We specifically highlight the benefits of multi-lingual pre-training and its consistent impact across our neural TTS systems for 8 Indian languages.
0 Replies
Loading