Reproducing and Dissecting Denoising Language Models for Speech Recognition

Reproducing and Dissecting Denoising Language Models for Speech Recognition

TMLR Paper9083 Authors

20 May 2026 (modified: 06 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for improving automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to error patterns of ASR models. However, the complexity of the DLM training pipeline has hindered wider investigation. This paper presents the first independent, large-scale empirical study of DLMs. Using a reproducible pipeline, we evaluate dozens of configurations across data augmentation, text-to-speech systems, and decoding strategies. Our analysis reveals that while traditional LMs are more efficient at lower training compute budgets, DLMs exhibit superior scaling and surpass LMs after a distinct compute tipping point, mirroring behaviors observed in diffusion language models. Our results show that the magnitude of DLM improvement is sensitive to the baseline ASR performance and vocabulary choice, and a key factor for improving performance is to condition the DLM on richer information from the ASR's hypothesis space, rather than just a single best guess. To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses, which consistently outperforms the previously proposed DSR decoding method. The code is publicly available at https://anonymous.4open.science/r/2025-dlm/.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Yu_Meng1

Submission Number: 9083

Loading