Enabling On-Device Large Language Models with 3D-Stacked Memory

Lita Yang; Kavya Sreedhar; Huichu Liu; Edith Beigne

Enabling On-Device Large Language Models with 3D-Stacked Memory

Lita Yang, Kavya Sreedhar, Huichu Liu, Edith Beigne

Published: 17 Oct 2024, Last Modified: 03 Dec 2024MLNCP PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: On-device AI, On-device LLMs, 3D integration, 3D-stacked memory, augmented reality

TL;DR: In this paper, we evaluate the memory power and area savings using 3D-stacked memory (3D-DRAM, 3D-SRAM) versus conventional 2D memory (LPDDR-DRAM, SRAM) for on-device LLMs (Distilled GPT-2, GPT-2, BART Base, BART Large).

Abstract: In this paper, we address the growing need for new types of memories to enable deployment of on-device large language models (LLMs) to resource-constrained augmented reality (AR) edge devices. We evaluate the memory power and area savings using 3D-stacked memory (3D-DRAM, 3D-SRAM) versus conventional 2D memory (LPDDR-DRAM, SRAM). At target inference rate of 5-100 inferences per second, 3D-DRAM consumes the least memory power across all the memory options, achieving ∼7-15x improvement in memory power consumption compared with conventional 2D memory across our benchmark suite of on-device LLMs (Distilled GPT-2, GPT-2, BART Base, and BART Large). While 3D-SRAM can reduce memory dynamic power, the leakage power consumption for storing such a large model becomes prohibitively costly, hence why 3D-DRAM becomes a better option than 3D-SRAM for on-device LLMs. Additionally, since 3D-DRAM significantly reduces the memory power consumption for on-device LLMs to 10’s of mWs, 3D-DRAM enables the deployment of much larger LLMs that previously could not be deployed with conventional DRAM and 2D SRAM solutions.

Submission Number: 7

Loading