Adapting Speech Foundation Models for L2 Speakers: Targeted Error Analysis and Adaptation under Proficiency Imbalance

Published: 26 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Second Language Speech Recognition, Speech Disfluencies, Speech Data Augmentation, Speech-Language Integrated Modeling
TL;DR: We propose a proficiency-driven adaptation framework for Whisper that improves ASR on low-proficiency L2 speech via targeted error analysis, disfluency augmentation, and speech-aware LLM-based reranking.
Presentation Preference: Yes
Abstract: Speech foundation models like Whisper have set new benchmarks in various ASR tasks, but they often underperform for second-language (L2) learners due to accent variation, disfluencies, and mispronunciations — speech characteristics that are underrepresented during current model pretraining. In this study, we examine how to adapt large speech foundation models—specifically Whisper—to better serve L2 speakers. Our framework begins with fine-grained error analysis across speaker proficiency levels, which identifies systematic failure modes such as hesitation insertions and high deletion, insertion & substitution in low- proficiency groups. This motivates adaptation strategies that explicitly account for proficiency-driven vari- ation in L2 speech. Based on these insights, we implement: (1) parameter-efficient multitask learning via LoRA to jointly model transcription and speaker proficiency, and (2) targeted data augmentation simulating disfluency patterns to mitigate recognition bias toward fluent speech. Preliminary results show that our proficiency-aware multitask model reduces WER across all proficiency levels, with the largest absolute improvement of 4.7% observed in the low proficiency group. Building on our current framework, we plan to explore several extensions to further enhance adaptation for low-proficiency L2 speech. These include prompt-based decoding with speech-aware LLMs and N-best hypothesis reranking using both phoneme- and word-level representations. We will also investigate dynamic thresholding mechanisms to better handle hesitation phenomena during decoding. These directions aim to expand the adaptability and interpretability of our pipeline, and provide deeper insights into modeling underrepresented L2 speaker populations.
Submission Number: 25
Loading