Cross-lingual transfer using phonological features for resource-scarce text-to-speechDownload PDF

Published: 15 Jun 2023, Last Modified: 28 Jun 2023SSW12Readers: Everyone
Keywords: text-to-speech, resource-scarce, phonological features, cross-lingual
Abstract: In this work, we explore the use of phonological features in cross-lingual transfer within resource-scarce settings. We modify the architecture of VITS to accept a phonological feature vector as input, instead of phonemes or characters. Subsequently, we train multispeaker base models using data from LibriTTS and then fine-tune them on single-speaker Afrikaans and isiXhosa datasets of varying sizes, representing the resource-scarce setting. We evaluate the synthetic speech both objectively and subjectively and compare it to models trained with the same data using the standard VITS architecture. In our experiments, the proposed system utilizing phonological features as input converges significantly faster and requires less data than the base system. We demonstrate that the model employing phonological features is capable of producing sounds in the target language that were unseen in the source language, even in languages with significant linguistic differences, and with only 5 minutes of data in the target language.
3 Replies

Loading