Text-Only Unsupervised Domain Adaptation for Neural Transducer-Based ASR Personalization Using Synthesized Data

Dong-Hyun Kim; Jae-Hong Lee; Joon-Hyuk Chang

Text-Only Unsupervised Domain Adaptation for Neural Transducer-Based ASR Personalization Using Synthesized Data

Dong-Hyun Kim, Jae-Hong Lee, Joon-Hyuk Chang

Published: 01 Jan 2024, Last Modified: 29 Sept 2024ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Research on personalizing neural transducer-based automatic speech recognition (ASR) systems using the text-only data is currently flourishing. Among various approaches, utilizing synthesized speech offers an advantage of adapting the entire ASR system. In this study, we explore the problem of personalization from a domain adaptation perspective and highlight the potential risk of overfitting associated with synthesized speech. To mitigate this risk, we propose the text-only unsupervised domain adaptation (ToUDA) strategy that robustly finetunes the generic ASR model on synthesized speech by incorporating parameter-averaging over time, model freezing, and filtering out-of-distribution instances. Via various experiments, we not only showcase the effectiveness of our approach but also uncover a noteworthy limitation when it comes to personalizing atypical speech.

Loading