Few-Shot Synthetic-Only Accent Adaptation for ASR via LLM-Guided Phoneme Editing
Keywords: accented ASR, synthetic-only training, few-shot adaptation, LLM-based phoneme editing
TL;DR: We show that synthetic speech generated via few-shot accent-speaker adaptation and LLM-guided phoneme editing can improve accented ASR without using any real accented speech for fine-tuning.
Abstract: Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. While synthetic speech has been used for augmentation, prior work typically mixes synthetic and real speech, and purely synthetic fine-tuning has shown inconsistent gains. We investigate whether synthetic data alone, generated through accent-aware phoneme editing and few-shot speaker adaptation, can improve accented ASR without using real accented speech. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)–based phoneme editing to generate accent-specific pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 33
Loading