Effect of Speech Modification on Wav2Vec2 Models for Children Speech Recognition

Published: 01 Jan 2024, Last Modified: 22 May 2025SPCOM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech modification methods normalize children's speech towards adults' speech, enabling off-the-shelf generic automatic speech recognition (ASR) for this low-resource scenario. On the other hand, ASR models like Wav2Vec2 have shown remarkable robustness towards various speakers, thus streamlining their deployment. This paper examines the benefit of speech modification methods when using Wav2Vec2 models on children's speech. We experimented with prototypical speech modification methods and found that while models trained on large datasets exhibit similar performance across unmodified and modified children's speech, models trained on smaller datasets exhibit notably enhanced performance with modified speech. However, analyzing age effects on PF-Star and CMU Kids evaluation sets, we observe that all Wav2Vec2 variants still underperform for children under 10 years. In this scenario, speech modification methods and their combinations help improve performance for small and large Wav2Vec2 models but have plenty of room for improvement.
Loading