Abstract: Denoising language models (DLMs) have been proposed
as a powerful alternative to traditional language models (LMs)
for improving automatic speech recognition (ASR),
motivated by their ability to use bidirectional context
and adapt to error patterns of ASR models.
However, the complexity of the DLM training pipeline has hindered wider investigation.
This paper presents the first independent, large-scale empirical study of DLMs.
Using a reproducible pipeline,
we evaluate dozens of configurations across data augmentation, text-to-speech systems, and decoding strategies.
Our analysis reveals that while traditional LMs are more efficient at lower training compute budgets,
DLMs exhibit superior scaling and surpass LMs after a distinct compute tipping point,
mirroring behaviors observed in diffusion language models.
Our results show that the magnitude of DLM improvement
is sensitive to the baseline ASR performance and vocabulary choice,
and a key factor for improving performance
is to condition the DLM on richer information from the ASR's hypothesis space,
rather than just a single best guess.
To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses,
which consistently outperforms the previously proposed DSR decoding method.
The code is publicly available at https://anonymous.4open.science/r/2025-dlm/.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu_Meng1
Submission Number: 9083
Loading