XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units

Published: 05 Mar 2025, Last Modified: 14 Apr 2025SCOPE - ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: State-Space Models, Neural Processing Units, Efficient Inference, Latency Optimization, Mamba
TL;DR: XAMBA optimizes SSMs on NPUs by transforming inefficient sequential operations into fast, memory-efficient computations.
Abstract: State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, unlike transformers with quadratic-complexity attention. This makes SSMs ideal for long-sequence tasks in natural language processing (NLP), vision, and edge AI applications such as real-time transcription, translation, and contextual search. These applications demand lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. While specialized accelerators have been proposed for emerging neural networks, designing new hardware is time-intensive, costly, and impractical for every model. Instead, optimizing models for existing neural processing units (NPUs) in AI PCs offers a scalable and efficient solution. Towards this end, we propose XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. Our approach follows a systematic three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet target Key Performance Indicator (KPI) requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA addresses key performance bottlenecks with two techniques: CumBA and ReduBA. These replace sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. In addition, ActiBA further enhances performance by mapping computationally expensive activation functions (e.g., Swish, Softplus) to NPU hardware using piecewise linear approximations, reducing latency with minimal accuracy loss. Experimental evaluations on an Intel® Core™ Ultra Series 2 AI PC demonstrate that XAMBA achieves significant performance improvements, reducing execution latency by up to 4.8× compared to baseline implementation. Our implementation is available at https://github.com/arghadippurdue/XAMBA.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 29
Loading