Abstract: Speech LLMs use speech embeddings as the prompt to a Large Language Model (LLM) and generate human readable text for the speech signal in an autoregressive manner. Teacher-forcing is a common approach used for training Speech LLMs, which is dissimilar to the procedure used during inference, creating a gap between training and inference known as exposure bias. To mitigate exposure bias, we propose Speech-N-LlaMA. Contrary to the existing Speech LLMs that have single pass through the LLM during training, Speech-N-LlaMA incorporates multi-pass training. Through multiple passes, Speech-N-LlaMA mitigates exposure bias and uses the error correction capability of LLM to improve the performance of Speech LLMs. We propose an N-pass loss and utterance level temperature sampling in Speech-N-LlaMA to achieve this. We evaluate four different model sizes on three benchmarks, and show up to 18% relative improvement in Word Error Rate (WER) compared to baseline Speech LLM, while not incurring any additional compute during inference.
Loading