Abstract: While sample-based Minimum Bayes Risk (MBR) decoding has shown to outperform beam search in many text-to-text generation tasks with modern LLMs, beam search remains the dominant approach for Automatic Speech Recognition (ASR) and Speech Translation (ST). To date, the efficacy of MBR decoding within modern speech systems lacks comprehensive evaluation.
Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks.
In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models, as well as supplementary autoregressive baselines.
We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated.
The results show that MBR decoding is a promising method for ASR and ST tasks that require high accuracy.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have revised the manuscript according to the feedback from the reviewers as follows.
CHANGES IN RESPONSE TO REVIEWER RkQs
--------------------------------------
1. [Critical] Fixed incomplete sentence (pages 3-4): The sentence "This result is consistent
with empirical findings showing that larger sample sizes lead to (Freitag et al., 2023)"
has been completed to read "...lead to higher generation quality (Freitag et al., 2023)."
2. [Critical] Identical beam search results: We verified the experimental logs and confirmed
that the results are correct. We added a note explaining that this behavior stems from the
highly peaked probability distributions of the Whisper model family, so the greedy path
(B=1) already coincides with the beam-optimal path even for wider beams.
3. [Would strengthen] MBR bias qualification: We added a clarifying sentence in Section 2.2
stating that the "center" selected by MBR reflects the model's own distribution, and
therefore MBR will not correct for model-intrinsic biases (e.g., a tendency toward shorter
sentences).
4. [Would strengthen] Beam search complexity O(GB): We added a note clarifying that G
represents the computational cost of a full decoding step per hypothesis, with vocabulary
size and pruning operations absorbed as constant factors, so complexity scales linearly
with beam width B.
5. [Would strengthen] Description of Equation 3: The phrase "measures the quality of
hypothesis y against reference y'" has been revised to "measures the quality of hypothesis
y by treating y' as a pseudo-reference drawn from the set of peer hypotheses," making
clear that y' is not a gold-standard reference.
---
CHANGES IN RESPONSE TO REVIEWER BC3i
--------------------------------------
6. [Required] Standard test set for LibriSpeech: We now evaluate on the full LibriSpeech
Clean test set (all 2,620 samples). For samples exceeding 30 seconds, we apply the
sequential long-form stitching algorithm following the Whisper model's official
recommendation. Only 9 samples (~0.3%) exceeded 30 seconds, so the impact of the original
exclusion was negligible. MBR decoding continues to outperform beam search on the full
set. For all other datasets, we retain the exclusion of >30-second samples and clarify
that this is to isolate MBR's effect from long-form heuristics.
7. [Required] Clarified data selection process: We added an explicit explanation that samples
longer than 30 seconds are excluded (not truncated) to avoid confounding effects of
long-form stitching algorithms.
8. [Recommended] Consensus decoding discussion: We revised the comparison section to
explicitly frame MBR as a generalized form of consensus decoding. In particular, we note
that MBR with token-level edit metrics (WER/Levenshtein) reduces to a form of ROVER-style
majority voting that directly optimizes for word error rate rather than MAP probability.
9. [Recommended] Pruning / decoding configuration: We added a dedicated "Decoding
configuration" paragraph specifying that we use the standard HuggingFace transformers beam
search implementation with no threshold pruning or diversity penalties beyond the beam
width parameter.
10. [Recommended] Short utterance analysis: We added a qualitative analysis explaining why
MBR underperforms on short utterances in AMI-IHM. Short utterances in this corpus are
predominantly backchannels and non-lexical fillers (e.g., "yeah", "hmm"). BLEU yields
zero overlap if a single token differs, creating a flat utility landscape where MBR cannot
reliably distinguish hypotheses.
---
CHANGES IN RESPONSE TO REVIEWER qJEk
--------------------------------------
11. [Requested] Alternative model architectures: We added experiments on two non-Whisper
sequence-to-sequence models: facebook/s2t-small-librispeech-asr (S2T) and
facebook/seamless-m4t-v2-large (SeamlessM4T). MBR decoding achieves competitive or
better performance than beam search for both models, confirming generalization across
autoregressive architectures. We also attempted to apply MBR to Wav2Vec 2.0 (CTC-based)
but found that its extremely peaked per-frame distributions make random sampling
equivalent to greedy decoding; this is now noted in both the main text and the
Limitations section.
12. [Requested] Epsilon sampling robustness: We added an explanation attributing the
robustness of MBR to epsilon values in ASR to the high-confidence nature of current ASR
models, which produce sharply peaked output distributions. We note that this observation
may not generalize universally across all ASR systems or tasks.
13. [Broader Impact] Environmental footprint: We added a Broader Impact Statement
acknowledging the higher energy cost of MBR's O(UN^2 + GN) complexity. We also describe
the doubling trick as a practical strategy to mitigate the additional computational cost.
---
Code: https://github.com/CyberAgentAILab/mbr-for-asr
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 6969
Loading