ASRLM: ASR-Robust Language Model Pre-training via Generative and Discriminative Learning

Published: 01 Jan 2024, Last Modified: 22 Nov 2024NLPCC (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rise of voice interface applications has renewed interest in improving the robustness of spoken language understanding(SLU). Many advances have come from end-to-end speech-language joint training, such as inferring semantics directly from speech signals and post-editing automatic speech recognition (ASR) output. Despite their performance achievements, these methods either suffer from the unavailability of a large number of paired error-prone ASR transcriptions and ground-truth annotations or are computationally costly. To mitigate these issues, we propose an ASR-robust pre-trained language model (ASRLM), which involves a generator generating simulated ASR transcriptions from ground-truth annotations and a sample-efficient discriminator distinguishing reasonable ASR errors from unrealistic ones. Experimental results demonstrate that ASRLM improves performance on a wide range of SLU tasks in the presence of ASR errors while saving 27% of the computation cost compared to baselines. Analysis also shows that our proposed generator is better than other simulation methods, including both BERT and GPT4-based, at simulating real-world ASR error situations.
Loading