3D Analog In-Memory Computing for Efficient Mixture of Experts Large Language Model Inference

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, AI hardware, In-memory computing
TL;DR: This paper shows that combining mixture of experts with 3D analog in-memory computing can lower inference costs of large language models, improving scalability, energy efficiency, and deployment feasibility.
Abstract: Large Language Models (LLMs) with their remarkable generative capacities have greatly impacted various fields but face scalability challenges due to their large parameter counts, resulting in high costs for training and inference. The trend of increasing model sizes exacerbates these challenges, particularly in terms of memory footprint, latency, and energy consumption. In this paper, we explore the deployment of Mixture of Experts (MoEs) – networks that use conditional computing to keep computational demands low despite having many parameters – on 3D Non-Volatile Memory (NVM)-based Analog In-Memory Computing (AIMC) hardware. When combined with the MoE architecture, this hardware, utilizing stacked NVM devices arranged in a crossbar array, offers a solution to the parameter fetching bottleneck typical in traditional models deployed on conventional von Neumann-based architectures. By simulating the deployment of MoEs on an abstract 3D AIMC system, we demonstrate that, due to their conditional compute mechanism, MoEs are inherently better suited to this hardware than conventional, dense model architectures. Our findings suggest that MoEs, in conjunction with emerging 3D NVM-based AIMC, can substantially reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.
Submission Number: 157
Loading