MARCA-v2: Mamba Accelerator with Complementary State Space Model Sparsity and Reconfigurable Architecture

Jinhao Li, Shan Huang, Jiaming Xu, Jun Liu, Ningyi Xu, Guohao Dai

Published: 01 Jan 2025, Last Modified: 25 Jan 2026IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsEveryoneRevisionsCC BY-SA 4.0
Abstract: Large Language models with state space model (SSM) especially Mamba have demonstrated remarkable capabilities in various domains. Compared to Transformers, Mamba reduces the quadratic computational complexity and achieves a higher algorithm performance. Current research on Mamba focuses primarily on integrating it with various application scenarios. However, there is limited research on optimizing for Mamba processing. Therefore, we profile the processing carefully and identify three main challenges in Mamba computations: (1) Large memory access overhead of element-wise operations in SSM. Based on MARCA architecture, the time proportion of SSM is still the bottleneck when the sequence length reaches 2048, accounting for 62.52% of the total runtime. Within the SSM, the memory access overhead of element-wise operations account for 97.17%, leading to consuming 96.56% of the time. (2) Inefficient sparse element-wise execution on MARCA architecture. SOTA architectures like MARCA propose a reconfigurable reduction tree to accelerate dense element-wise operations but lack effective sparse support for sparse execution. When 30% of the elementwise operations are skipped, these skipped operations are still mapped to the PE array and trigger redundant execution cycles, incurring 1.43× performance gap with the ideal. (3) Large area overhead for nonlinear function unit. Exponential function and SiLU are two main nonlinear functions in SSM. Previous methods design specific unit for acceleration, leading to 38% and 18% area overheads of the PE. In response to these challenges, we propose a new Mamba accelerator with complementary state space model (SSM) sparsity and reconfigurable architecture, MARCA-v2, based on MARCA, to support fast and energy-efficient Mamba computations. Three novel techniques are as follows: (1) Complementary SSM sparsity with column-wise granularity. We first profile the numerical distributions of activations in SSM and propose a column-wise complementary static sparsity for SSM computation. To further enable lightweight and hardware-friendly sparse computation, we propose a δ-bitmap encoding scheme for compressed storage and introduce two abstractions for sparse element-wise operations. (2) Lightweight sparse element-wise architecture. Based on the hardware friendly sparsity algorithm and MARCA architecture, we design and integrate a lightweight Metadata Processing Unit (Meta-PU) into the existing pipeline, which decodes the sparsity metadata and dynamically generates control signals to guide PE arrays. The overall architecture can efficiently support both dense and sparse operations, maximizing speed and energy efficiency. (3) Reusable nonlinear function unit based on reconfigurable PE arrays. We decompose the exponential function and SiLU into several element-wise operations. Thus, the reconfigurable PEs are fully reused to execute nonlinear functions with negligible accuracy loss. We conduct extensive experiments on Mamba model families with different sizes. Experimental results show that in the prefill stage, MARCA-v2 achieves 1.77-10.87×, 1.03- 1.08×, and 4.78-9.10× speedup and 8.29-33.47×/1.03-1.08×/4.78-9.10× energy efficiency improvement compared with Mamba-GPU, MARCA and Spada, respectively. In the decode stage, MARCA-v2 achieves 0.88-7.65×/1.00-1.01×/1.19-1.64× speedup and 3.11-27.01×/1.00-1.01×/1.19-1.64× energy efficiency improvement compared with Mamba-GPU, MARCA and Spada, respectively.
Loading