Accent-Aware Text-to-Speech for Nigerian English: Building Inclusive Voice AI from Community-Curated Data

Christianah Titilope Oyewale; Benjamin Ogbonna

Accent-Aware Text-to-Speech for Nigerian English: Building Inclusive Voice AI from Community-Curated Data

Christianah Titilope Oyewale, Benjamin Ogbonna

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Nigerian English, Text-to-Speech, StyleTTS2, Accent-Aware TTS, Speech Synthesis, Whisper, Phoneme Embeddings, Diffusion Models, Inclusive AI

Abstract: Voice technologies often fail to represent the linguistic diversity of emerging markets, particularly African accents and languages. This work presents a multilingual text-to-speech (TTS) system tailored for Nigerian-accented English, built on the StyleTTS2 architecture and trained on a community-curated dataset spanning the three major Nigerian ethnic groups: Yoruba, Igbo, and Hausa. To construct the dataset, volunteers with technical backgrounds recorded readings from Nigerian-published texts across domains such as religion, politics, history, and education. Recordings ranged from 1 to 6 hours per speaker. Using Whisper for transcription, audio was converted into timestamped SRT files and manually corrected by a four-person team. A custom script segmented the audio into variable-length clips (2–30 seconds), yielding over 4,000 paired samples. The dataset was split 80/20 for training and evaluation. During preprocessing, transcriptions were converted into phonemes using the phonemizer Python package. The model learns the relationship between phoneme sequences and speaker-specific acoustic features, including pitch and prosody. At inference time, given a new text input and reference audio, the model mimics the speaker’s vocal style by predicting pitch contours and generating expressive speech that reflects the speaker’s accent and emotional tone. The architecture integrates phoneme-level BERT embeddings, style and prosody encoders, and a diffusion-based decoder. Informal evaluation was conducted using human raters, with three evaluators per sample. Approximately 87% of synthesized outputs were rated as accent-faithful and emotionally consistent with the reference audio. Additionally, Word Error Rate (WER) was used to assess intelligibility across test samples, with an average WER of 12.4% across the test set. This work demonstrates the feasibility of building inclusive voice AI using modest resources and community participation, with potential applications in education, public services, and digital accessibility across Africa.

Submission Number: 296

Loading