Training Speech Recognition Models to Follow Instructions

Cheng-I Lai; Zhiyun Lu; Liangliang Cao; Ruoming Pang

Training Speech Recognition Models to Follow Instructions

Cheng-I Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang

Published: 28 Oct 2023, Last Modified: 26 Nov 2023Instruction Workshop @ NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: Large Language Model, Speech Recognition, Speech Foundation Model, Instruction-Following

TL;DR: Automatic Speech Recognition (ASR) model can follow instructions even without the help of any pretrained LLM.

Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. In this paper, we train a speech recognition model to follow a diverse set of free-form text instructions for a multitude of speech recognition tasks -- ranging from simple transcript manipulation to summarization. We emphasize that even without pre-trained LLMs or speech modules, a Listen-Attend-Spell model trained from scratch on Librispeech understands and executes instructions with high fidelity. This preliminary findings highlight the potential of instruction-following training to advance speech foundation models.

Submission Number: 34

Loading