Text-Only Unsupervised Domain Adaptation for Neural Transducer-Based ASR Personalization Using Synthesized Data
Abstract: Research on personalizing neural transducer-based automatic speech recognition (ASR) systems using the text-only data is currently flourishing. Among various approaches, utilizing synthesized speech offers an advantage of adapting the entire ASR system. In this study, we explore the problem of personalization from a domain adaptation perspective and highlight the potential risk of overfitting associated with synthesized speech. To mitigate this risk, we propose the text-only unsupervised domain adaptation (ToUDA) strategy that robustly finetunes the generic ASR model on synthesized speech by incorporating parameter-averaging over time, model freezing, and filtering out-of-distribution instances. Via various experiments, we not only showcase the effectiveness of our approach but also uncover a noteworthy limitation when it comes to personalizing atypical speech.
Loading