From Child Adaptation to Adult Retention: Merging Specialized Arabic–English ASR Models Across Architectures
Keywords: Automatic speech recognition, children’s speech, model merging, catastrophic forgetting, Adult Children ASR adaptation
Abstract: Automatic speech recognition (ASR) systems exhibit persistent performance disparities across age groups and speaker nativity, with children’s speech remaining a systematically underrepresented and challenging domain, even in high-resource languages. Existing adaptation strategies predominantly rely on fine-tuning, which often induces catastrophic forgetting and degrades performance on adult speech; these limitations are further amplified in bilingual children’s ASR, where robust cross-language generalization is required. In this work, we explore weight-space model merging as a principled framework for age-robust and language-inclusive speech modeling.
Starting from a shared multilingual base model, we fine-tune complementary child-adapted checkpoints and merge them using balanced weighting to preserve adult representations while incorporating age- and language-specific adaptations. Across all benchmarks, model merging consistently improves recognition accuracy for children while retaining or in some cases improving adult performance, outperforming fine-tuning and joint training baselines.
These results demonstrate that model merging provides a scalable and data-efficient alternative to fine-tuning for inclusive ASR across age groups, speaker nativity (L1 and L2 Arabic speakers), and languages (Arabic and English).
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: Automatic speech recognition, children’s speech, model merging, catastrophic forgetting, multilingual adaptation
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Arabic, English
Submission Number: 10615
Loading