Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Short Paper
Submission Track: Speech and Multimodality
Submission Track 2: NLP Applications
Keywords: Multimodal Learning, Parameter-Efficient Adaptation, Speech Recognition, Generative Error Correction
TL;DR: We propose a new cross-modal fusion method integrating large-scale pre-trained speech (Whisper) and language (LLaMA) models for Generative Error Correction.
Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we assess our fusion technique, demonstrating a 37.66\% improvement in word error rate (WER) relative performance compared to the n-best Oracle. To encourage future research, we have made our code and pre-trained models open source at [https://github.com/Srijith-rkr/Whispering-LLaMA](https://github.com/Srijith-rkr/Whispering-LLaMA)
Submission Number: 4032
Loading