MLA-Conformer: A Latent Attention-Enhanced Conformer for Efficient Speech Recognition

Published: 01 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Conformer, Multi Head Latent Attention, Automatic Speech Recognition, Efficient Attention
TL;DR: Mutihead Latent Attention Conformer cuts attention cost and boosts efficiency in ASR while keeping accuracy close to baseline, making it ideal for real-time and resource-limited speech recognition.
Presentation Preference: Yes
Abstract: The Conformer architecture has set a high standard in automatic speech recognition (ASR) by effectively combining convolutional neural networks with multi-head self-attention modules, enabling the modeling of both local and global dependencies. However, the quadratic computational and memory complexity of standard multi-head self-attention limits the scalability of Conformer models, especially for long audio sequences and real-time applications. In this work, we propose integrating Multi-Head Latent Attention (MLA), a low-rank attention approximation, into the Conformer encoder to reduce complexity without sacrificing performance. MLA introduces a fixed number of latent vectors that mediate attention computation, reducing the attention cost from $\mathcal{O}(n^2)$ to $\mathcal{O}(nk)$, where $k \ll n$. We describe the architectural modifications for seamless integration and present comprehensive experiments on the LibriSpeech dataset. Our MLA-Conformer achieves word error rates of 2.3\% and 4.7\% on the test-clean and test-other subsets, respectively, compared to the baseline Conformer's 2.1\% and 4.3\%. These results demonstrate that MLA-Conformer provides an effective trade-off between efficiency and accuracy, making it suitable for deployment in resource-constrained and real-time speech recognition scenarios.
Submission Number: 3
Loading