Speech-N-LlaMA: Improving Speech LLMs with Multi-Pass Training

Amit Kumar Singh Yadav; Gil Keren; Desh Raj; Wei Zhou; Junteng Jia; Ke Li; Ying Xu; Chunyang Wu; Jay Mahadeokar; Ozlem Kalinli

Speech-N-LlaMA: Improving Speech LLMs with Multi-Pass Training

Amit Kumar Singh Yadav, Gil Keren, Desh Raj, Wei Zhou, Junteng Jia, Ke Li, Ying Xu, Chunyang Wu, Jay Mahadeokar, Ozlem Kalinli

Published: 01 Jan 2025, Last Modified: 14 Aug 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speech LLMs use speech embeddings as the prompt to a Large Language Model (LLM) and generate human readable text for the speech signal in an autoregressive manner. Teacher-forcing is a common approach used for training Speech LLMs, which is dissimilar to the procedure used during inference, creating a gap between training and inference known as exposure bias. To mitigate exposure bias, we propose Speech-N-LlaMA. Contrary to the existing Speech LLMs that have single pass through the LLM during training, Speech-N-LlaMA incorporates multi-pass training. Through multiple passes, Speech-N-LlaMA mitigates exposure bias and uses the error correction capability of LLM to improve the performance of Speech LLMs. We propose an N-pass loss and utterance level temperature sampling in Speech-N-LlaMA to achieve this. We evaluate four different model sizes on three benchmarks, and show up to 18% relative improvement in Word Error Rate (WER) compared to baseline Speech LLM, while not incurring any additional compute during inference.

Loading