VITS-Based Data Augmentation for Improved ASR Performance and Domain Adaptation

ACL ARR 2024 June Submission1997 Authors

15 Jun 2024 (modified: 17 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Although significant advancements have been made in end-to-end speech recognition, it still remains a challenging task when dealing with low-resource scenarios, even with the utilization of traditional data augmentation methods. Recent technological progress, demonstrated by the success of VITS and its variations, has spurred interest in exploring Text-to-Speech (TTS) synthesis for data augmentation to address the aforementioned difficulties. In this study, we investigate the effectiveness of integrating synthetic speech generated by VITS into the train sets of ASR systems. Through comprehensive experiments, we assess the impact of this approach on improving the generalization and performance of ASR models in English, Mandarin, and Japanese. Experimental results indicate that the average character-level accuracy of the VITS-based data augmentation method matches the best performance observed among traditional data augmentation methods before model transfer. After model transfer, the average character-level accuracy of the VITS-based data augmentation method significantly outperforms all traditional methods, surpassing Speed Perturbation, the best-performing traditional method, by 3.5%, as well as Tacotron2 and Fastspeech. Our findings indicate that models trained with the VITS-based data augmentation method exhibit enhanced resilience towards domain shift challenges, demonstrating improved adaptability across varied linguistic contexts, thus highlighting the potential of VITS as a valuable data augmentation technique.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English Chinese Japanese
Submission Number: 1997
Loading