Harnessing Instruction-Tuned Large Language Model for Guiding End-to-End Speech Recognition

Harnessing Instruction-Tuned Large Language Model for Guiding End-to-End Speech Recognition

ACL ARR 2024 June Submission1386 Authors

14 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Modern large language models (LLMs) are adept at performing various text generation tasks when prompted with instructions designed for specific objectives. These abilities can enhance the quality of text produced by automatic speech recognition (ASR), enabling the selection of words that are more semantically accurate. However, relying solely on LLMs to correct errors in ASR predictions may lead to unintended word generations or modifications that do not accurately reflect the speech input. In this work, we propose a novel ASR model that integrates the text generation capabilities of LLMs, while ensuring proper alignment with speech inputs. Specifically, our model is built on the attention-based encoder-decoder (AED) structure, with the LLM serving as a front-end feature extractor for the decoder. The decoder is trained to predict words from the LLM-derived features, where cross-attention accounts for aligning these features with the speech encodings from the encoder. We also design an effective prompting strategy that uses a hypothesized text sequence to extract linguistic information beneficial for performing ASR. Experimental results demonstrate that our proposed model outperforms conventional AED-based models across major ASR tasks.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: automatic speech recognition

Languages Studied: English

Submission Number: 1386

Loading