BlockDecoder: Boosting ASR Decoders with Context and Merger Modules

Darshan Prabhu; Preethi Jyothi

BlockDecoder: Boosting ASR Decoders with Context and Merger Modules

Darshan Prabhu, Preethi Jyothi

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automatic speech recognition, Attention Encoder Decoder, Efficient Decoder

TL;DR: We propose BlockDecoder, a novel ASR decoder architecture that separates textual context building from audio-text integration, achieving a ~2x speed-up over traditional decoders without performance degradation across datasets, languages and tasks.

Abstract: Attention-based encoder decoder models remain a popular choice for state-of-the-art automatic speech recognition (ASR). These models combine a powerful audio encoder that extracts rich acoustic features with a decoder that autoregressively produces the ASR output. The decoder handles two critical tasks: (1) building rich text-only context and (2) merging acoustic information from the encoder to ensure the predictions remain faithful to the audio. We observe a systematic pattern across the attention distributions of decoder layers in prior architectures: the initial layers direct most attention towards building textual context, while the later layers largely focus on merging acoustic and textual information for the final predictions. Leveraging this key insight, we propose **BlockDecoder**, a novel decoder architecture comprising two distinct components: a text encoder that is purely text-based, and a **Merger** that combines information from the audio encoder and text encoder to generate output tokens. Unlike traditional decoders, the **Merger** autoregressively predicts a sequence of K tokens within a *block* of size K, while relying on the same precomputed contextual information from both text and audio encoders across the block. This design choice allows for the efficient reuse of encoder representations. The separation of the decoder into the text encoder and the **Merger** promotes modularity and more flexible control of parameters via the number of text encoder and **Merger** layers. As a result, **BlockDecoder** yields a significant speedup ($\sim2$x) compared to traditional decoders, across diverse datasets, languages, and speech tasks, without any degradation in performance. The code is available at https://github.com/csalt-research/blockdecoder.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 19552

Loading