ConMamba: A Convolution-Augmented Mamba Encoder Model for Efficient End-to-End ASR Systems

Haoxiang Hou, Xun Gong, Yanmin Qian

Published: 2024, Last Modified: 14 May 2025ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: End-to-End Automatic Speech Recognition (ASR) models, such as Conformer, excel in accuracy but face limitations in computational complexity and positional awareness, hindering their use in real-time or resource-constrained settings. State Space Models (SSMs), particularly the Mamba model with its time-varying mechanism, offer a more efficient alternative. We propose ConMamba, a Convolution-Augmented Mamba En-coder model, which replaces the Conformer's Multi-Head Self-Attention with Mamba layers and adds convolutional layers for capturing both local and global features. Experiments on the LibriSpeech dataset show ConMamba matches traditional Conformer performance on short speech segments and outperforms them on longer ones, enhancing robustness and efficiency for practical ASR applications.