Abstract: Modern large language models (LLMs) are adept at performing various text generation tasks when prompted with instructions designed for specific objectives. These abilities can enhance the quality of text produced by automatic speech recognition (ASR), enabling the selection of words that are more semantically accurate. However, relying solely on LLMs to correct errors in ASR predictions may lead to unintended word generations or modifications that do not accurately reflect the speech input. In this work, we propose a novel ASR model that integrates the text generation capabilities of LLMs, while ensuring proper alignment with speech inputs. Specifically, our model is built on the attention-based encoder-decoder (AED) structure, with the LLM serving as a front-end feature extractor for the decoder. The decoder is trained to predict words from the LLM-derived features, where cross-attention accounts for aligning these features with the speech encodings from the encoder. We also design an effective prompting strategy that uses a hypothesized text sequence to extract linguistic information beneficial for performing ASR. Experimental results demonstrate that our proposed model outperforms conventional AED-based models across major ASR tasks.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition
Languages Studied: English
Submission Number: 1386
Loading