An Annotation System for Controllable Speech Synthesis in Wolof

ACL ARR 2025 May Submission7908 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in deep learning have enabled the creation of expressive and controllable speech synthesis models. However, the creation of such models requires the collection and annotation of large amounts of data, which limits their applicability to low-resource languages. In this paper, we propose an automatic annotation pipeline to bypass the tedious process of annotating parameters such as prosody or emotion in a text-to-speech dataset. Our system rebalances the distribution of speech features in the dataset and then uses a large language model with Gemma 2 to predict relevant annotations in the form of textual descriptions, with zero minutes of expert annotation. As most of the features extracted are language agnostic, we obtain a generic annotation procedure that we evaluate by finetuning a controllable text-to-speech model on a low-resource language, Wolof. The results show that our model acquires a greater ability to control prosody, with a gain in pitch correlation of +0.09 and a speaker similarity of 0.54. The chosen architecture also performed well on Wolof, with a perceptual quality of 3.34 and a word error rate of 0.45.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: natual language processing, speech synthesis, low resource, wolof
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: wolof
Submission Number: 7908
Loading