Keywords: speech recognition, denoising language model, data augmentation, decoding strategies, reproducibility
TL;DR: First independent open-source reproduction and large-scale empirical study of denoising language models for speech recognition, establishing a strong, reproducible baseline and providing insights into key design choices.
Abstract: Denoising language models (DLMs) have been proposed
as a powerful alternative to traditional autoregressive language models (LMs)
for automatic speech recognition (ASR),
motivated by their ability to use bidirectional context
and adapt to a specific ASR model's error patterns.
However, the complexity of the DLM training pipeline has hindered wider investigation.
This paper presents the first independent, large-scale empirical study of the DLMs paradigm.
We build and release a complete, reproducible pipeline to systematically investigate the impact of key design choices.
We evaluate dozens of configurations across multiple axes, including various data augmentation techniques
(e.g., SpecAugment, dropout, mixup),
different text-to-speech systems,
and multiple decoding strategies.
Our comparative analysis in a common subword vocabulary setting
demonstrates that our best DLM outperforms our best traditional LM.
However, we observe smaller improvements than those reported in prior character-based work,
which indicates that the DLM's performance is highly conditional on factors such as the vocabulary.
Our analysis reveals that a key factor for improving performance
is to condition the DLM on richer information from the ASR's hypothesis space,
rather than just a single best guess.
To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses,
which consistently outperforms the previously proposed DSR decoding method.
We believe our findings and public pipeline provide a crucial foundation for the community
to better understand, improve, and build upon this promising class of models.
The code is publicly available at https://anonymous.4open.science/r/2025-dlm/.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18160
Loading