Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Srijith Radhakrishnan; Chao-Han Huck Yang; Sumeer Ahmad Khan; Rohit Kumar; Narsis A. Kiani; David Gomez-Cabrero; Jesper Tegnér

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper Tegnér

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Speech and Multimodality

Submission Track 2: NLP Applications

Keywords: Multimodal Learning, Parameter-Efficient Adaptation, Speech Recognition, Generative Error Correction

TL;DR: We propose a new cross-modal fusion method integrating large-scale pre-trained speech (Whisper) and language (LLaMA) models for Generative Error Correction.

Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we assess our fusion technique, demonstrating a 37.66\% improvement in word error rate (WER) relative performance compared to the n-best Oracle. To encourage future research, we have made our code and pre-trained models open source at [https://github.com/Srijith-rkr/Whispering-LLaMA](https://github.com/Srijith-rkr/Whispering-LLaMA)

Submission Number: 4032

Loading