Abstract: We propose a unified framework for low-resource automatic speech recognition tasks named meta-audio concatenation (MAC). It is easy to implement and can be carried out in extremely low-resource environments. Mathematically, we give a clear description of the MAC framework from the perspective of Bayesian sampling. We propose a broad notion of meta-audio sets for the concatenative synthesis text-to-speech system to meet the modeling demands of different languages and different scenarios when using the system. By the proper meta-audio set, one can integrate language pronunciation rules in a convenient way. Besides, it can also help reduce the difficulty of force alignment, improve the diversity of synthesized audios, and solve the ``out of vocabulary'' (OOV) issue in synthesis. Our experiments have demonstrated the great effectiveness of MAC on low-resource ASR tasks. On Cantonese, Taiwanese, and Japanese ASR tasks, the MAC method can reduce the character error rate (CER) by more than 15% and achieve comparable performance to the fine-tuned wav2vec2 model. Among them, it is worth mentioning that we achieve a 10.9% CER on the Common Voice Cantonese ASR task, leading to about 30% relative improvement compared to the wav2vec2 (with fine-tuning), which is a new SOTA.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=1wHEALby6u
Changes Since Last Submission: We appreciate pointing out two related works. In fact, we have significant improvement over their work. We have now added a discussion in the relate work section in the article. Two related works are
C. Du, H. Li, Y. Lu, L. Wang and Y. Qian, "Data Augmentation for end-to-end Code-Switching Speech Recognition," 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 2021, pp. 194-200, doi: 10.1109/SLT48900.2021.9383620. (also https://arxiv.org/pdf/2011.02160.pdf)
R. Zhao, J. Xue, J. Li, W. Wei, L. He and Y. Gong, "On Addressing Practical Challenges for RNN-Transducer," 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 2021, pp. 526-533, doi: 10.1109/ASRU51503.2021.9688101. (also https://www.microsoft.com/en-us/research/uploads/prod/2022/01/ASRU_2021_splicedata.pdf)
Specifically, in the audio splicing data augmentation method in (Du et al., 2021), they replace the English audio part of the code-switching audio, which is a simple and preliminary splicing method. Therefore, the spliced audio diversity is limited, and it is difficult to introduce audio containing OOV (out of vocabulary) text in this spliced method. They slightly improve the WER from 13.56 to 13.02 on Mandarin-English code switching dataset by their audio splicing method. And Zhao et al. (2021) focus on adaptation to new domains and is out of scope of low resource tasks.
In our work, we propose a broad notion of meta audio set for the concatenative synthesis text-to-speech system to meet the modeling needs of different languages and different scenes when using the system. By the concatenative synthesis text-to-speech system, we can integrate language pronunciation rules easily. Besides, this can also help reduce the difficulty of force alignment, improve the diversity of synthesized audio, and solve the OOV problem in synthesis. Further, we give a clear mathematical description of MAC framework from the perspective of bayesian sampling. We achieve a 10.9 character error rate (CER) on the common voice Cantonese ASR task, bringing about 30% relative improvement compared to the wav2vec2 (with fine-tuning).
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 886
Loading