Re-evaluating Minimum Bayes Risk Decoding for Automated Speech Recognition Tasks

Published: 12 May 2026, Last Modified: 12 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While sample-based Minimum Bayes Risk (MBR) decoding has shown to outperform beam search in many text-to-text generation tasks with modern LLMs, beam search remains the dominant approach for Automatic Speech Recognition (ASR) and Speech Translation (ST). To date, the efficacy of MBR decoding within modern speech systems lacks comprehensive evaluation. Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models, as well as supplementary autoregressive baselines. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for ASR and ST tasks that require high accuracy.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have revised the manuscript according to the feedback from the reviewers as follows. CHANGES IN RESPONSE TO REVIEWER RkQs -------------------------------------- 1. [Critical] Fixed incomplete sentence (pages 3-4): The sentence "This result is consistent with empirical findings showing that larger sample sizes lead to (Freitag et al., 2023)" has been completed to read "...lead to higher generation quality (Freitag et al., 2023)." 2. [Critical] Identical beam search results: We verified the experimental logs and confirmed that the results are correct. We added a note explaining that this behavior stems from the highly peaked probability distributions of the Whisper model family, so the greedy path (B=1) already coincides with the beam-optimal path even for wider beams. 3. [Would strengthen] MBR bias qualification: We added a clarifying sentence in Section 2.2 stating that the "center" selected by MBR reflects the model's own distribution, and therefore MBR will not correct for model-intrinsic biases (e.g., a tendency toward shorter sentences). 4. [Would strengthen] Beam search complexity O(GB): We added a note clarifying that G represents the computational cost of a full decoding step per hypothesis, with vocabulary size and pruning operations absorbed as constant factors, so complexity scales linearly with beam width B. 5. [Would strengthen] Description of Equation 3: The phrase "measures the quality of hypothesis y against reference y'" has been revised to "measures the quality of hypothesis y by treating y' as a pseudo-reference drawn from the set of peer hypotheses," making clear that y' is not a gold-standard reference. --- CHANGES IN RESPONSE TO REVIEWER BC3i -------------------------------------- 6. [Required] Standard test set for LibriSpeech: We now evaluate on the full LibriSpeech Clean test set (all 2,620 samples). For samples exceeding 30 seconds, we apply the sequential long-form stitching algorithm following the Whisper model's official recommendation. Only 9 samples (~0.3%) exceeded 30 seconds, so the impact of the original exclusion was negligible. MBR decoding continues to outperform beam search on the full set. For all other datasets, we retain the exclusion of >30-second samples and clarify that this is to isolate MBR's effect from long-form heuristics. 7. [Required] Clarified data selection process: We added an explicit explanation that samples longer than 30 seconds are excluded (not truncated) to avoid confounding effects of long-form stitching algorithms. 8. [Recommended] Consensus decoding discussion: We revised the comparison section to explicitly frame MBR as a generalized form of consensus decoding. In particular, we note that MBR with token-level edit metrics (WER/Levenshtein) reduces to a form of ROVER-style majority voting that directly optimizes for word error rate rather than MAP probability. 9. [Recommended] Pruning / decoding configuration: We added a dedicated "Decoding configuration" paragraph specifying that we use the standard HuggingFace transformers beam search implementation with no threshold pruning or diversity penalties beyond the beam width parameter. 10. [Recommended] Short utterance analysis: We added a qualitative analysis explaining why MBR underperforms on short utterances in AMI-IHM. Short utterances in this corpus are predominantly backchannels and non-lexical fillers (e.g., "yeah", "hmm"). BLEU yields zero overlap if a single token differs, creating a flat utility landscape where MBR cannot reliably distinguish hypotheses. --- CHANGES IN RESPONSE TO REVIEWER qJEk -------------------------------------- 11. [Requested] Alternative model architectures: We added experiments on two non-Whisper sequence-to-sequence models: facebook/s2t-small-librispeech-asr (S2T) and facebook/seamless-m4t-v2-large (SeamlessM4T). MBR decoding achieves competitive or better performance than beam search for both models, confirming generalization across autoregressive architectures. We also attempted to apply MBR to Wav2Vec 2.0 (CTC-based) but found that its extremely peaked per-frame distributions make random sampling equivalent to greedy decoding; this is now noted in both the main text and the Limitations section. 12. [Requested] Epsilon sampling robustness: We added an explanation attributing the robustness of MBR to epsilon values in ASR to the high-confidence nature of current ASR models, which produce sharply peaked output distributions. We note that this observation may not generalize universally across all ASR systems or tasks. 13. [Broader Impact] Environmental footprint: We added a Broader Impact Statement acknowledging the higher energy cost of MBR's O(UN^2 + GN) complexity. We also describe the doubling trick as a practical strategy to mitigate the additional computational cost. ---
Code: https://github.com/CyberAgentAILab/mbr-for-asr
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 6969
Loading