Abstract: While sample-based Minimum Bayes Risk (MBR) decoding has shown to outperform beam search in many text-to-text generation tasks with modern LLMs, beam search remains the dominant approach for Automatic Speech Recognition (ASR) and Speech Translation (ST). To date, the efficacy of MBR decoding within modern speech systems lacks comprehensive evaluation.
Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks.
In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models.
We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated.
The results show that MBR decoding is a promising method for ASR and ST tasks that require high accuracy.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 6969
Loading