Reproducing and Dissecting Denoising Language Models for Speech Recognition

Reproducing and Dissecting Denoising Language Models for Speech Recognition

ICLR 2026 Conference Submission18160 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech recognition, denoising language model, data augmentation, decoding strategies, reproducibility

TL;DR: First independent open-source reproduction and large-scale empirical study of denoising language models for speech recognition, establishing a strong, reproducible baseline and providing insights into key design choices.

Abstract: Denoising language models (DLMs) have been proposed as a powerful alternative to traditional autoregressive language models (LMs) for automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to a specific ASR model's error patterns. However, the complexity of the DLM training pipeline has hindered wider investigation. This paper presents the first independent, large-scale empirical study of the DLMs paradigm. We build and release a complete, reproducible pipeline to systematically investigate the impact of key design choices. We evaluate dozens of configurations across multiple axes, including various data augmentation techniques (e.g., SpecAugment, dropout, mixup), different text-to-speech systems, and multiple decoding strategies. Our comparative analysis in a common subword vocabulary setting demonstrates that our best DLM outperforms our best traditional LM. However, we observe smaller improvements than those reported in prior character-based work, which indicates that the DLM's performance is highly conditional on factors such as the vocabulary. Our analysis reveals that a key factor for improving performance is to condition the DLM on richer information from the ASR's hypothesis space, rather than just a single best guess. To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses, which consistently outperforms the previously proposed DSR decoding method. We believe our findings and public pipeline provide a crucial foundation for the community to better understand, improve, and build upon this promising class of models. The code is publicly available at https://anonymous.4open.science/r/2025-dlm/.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18160

Loading