Keywords: speech recognition, denoising language model, data augmentation, decoding strategies, reproducibility
TL;DR: First independent open-source reproduction and large-scale empirical study of denoising language models for speech recognition, establishing a strong, reproducible baseline and providing insights into key design choices.
Abstract: Denoising language models (DLMs) have been proposed
as a powerful alternative to traditional language models (LMs)
for automatic speech recognition (ASR),
motivated by their ability to use bidirectional context
and adapt to a specific ASR model's error patterns.
However, the complexity of the DLM training pipeline has hindered wider investigation.
This paper presents the *first independent, large-scale empirical study* of DLMs.
We build and release a *complete, reproducible pipeline* to systematically investigate the impact of key design choices.
We evaluate dozens of configurations across multiple axes, including various data augmentation techniques
(e.g., SpecAugment, dropout, mixup),
different text-to-speech systems,
and multiple decoding strategies.
Our comparative analysis in a common subword vocabulary setting
demonstrates that *DLMs outperform traditional LMs*,
but only after a distinct compute tipping point.
While LMs are more efficient at lower budgets, DLMs scale better with longer training,
mirroring behaviors observed in diffusion language models.
However, we observe smaller improvements than those reported in prior character-based work,
which indicates that the DLM's performance is conditional on factors such as the vocabulary.
Our analysis reveals that a key factor for improving performance
is to condition the DLM on *richer information from the ASR's hypothesis space*,
rather than just a single best guess.
To this end, we introduce *DLM-sum, a novel method for decoding from multiple ASR hypotheses*,
which consistently outperforms the previously proposed DSR decoding method.
We believe our findings and public pipeline provide a crucial foundation for the community
to better understand, improve, and build upon this promising class of models.
The code is publicly available at https://anonymous.4open.science/r/2025-dlm/.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18160
Loading