Whisfusion: Parallel ASR Decoding via a Diffusion Transformer

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automatic Speech Recognition(ASR), Non-Autoregressive Models, Diffusion Transformers, Whisper, Speech-to-Text
TL;DR: Whisfusion is a non-autoregressive ASR model combining a Whisper encoder and a diffusion transformer, delivering significant speed-ups over autoregressive models without sacrificing accuracy.
Abstract: Fast automatic speech recognition (ASR) is crucial for applications such as captioning and transcription. Although modern ASR encoders can process up to ~30 seconds of audio in a single pass, Whisper-style autoregressive (AR) decoders still generate tokens sequentially, making decoding latency grow linearly with utterance length. We propose Whisfusion, a non-autoregressive (NAR) ASR framework that fuses a frozen pre-trained Whisper encoder with a masked-diffusion text decoder. At each diffusion step, the decoder conditions on the full acoustic context and updates all tokens in parallel, mitigating the AR latency bottleneck while preserving Whisper-compatible generative structure. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning bridges audio and text, and we introduce Parallel Diffusion Decoding (PDD), an ASR-tailored batch-parallel sampling scheme that improves the accuracy–latency trade-off in low-to-mid batch regimes. With 6.5k hours of training data, Whisfusion reaches 4.9\% WER on LibriSpeech test-clean, comparable to similarly sized Whisper model (Whisper-small at 5.0\%), while enabling much faster decoding. In particular, on 20–30s segments within Whisper’s 30s window, Whisfusion reduces decoding time from 674.7 ms to 80.7 ms (8.4× faster) at similar accuracy, demonstrating an efficient NAR operating point for Whisper-compatible ASR.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24589
Loading