CIM-Aware Quantization for Energy Efficient Generative AI Models

Published: 03 Nov 2023, Last Modified: 03 Nov 2023SAGE 2023EveryoneRevisionsBibTeX
Keywords: Generative AI, large language models, quantization, sparsity, compute-in-memory, energy-efficiency
TL;DR: We proposed a CIM aware post-training quantization for highly energy-efficient large language modes (LLMs)
Abstract: Large language models (LLMs) with generative capabilities have showcased outstanding performance across a diverse applications. However, deploying these models on resource- constrained edge and mobile devices during inference mode is very challenging due to their massive computations. In this paper, we proposed a compute-in-memory (CIM) aware post-training quantization and sparsity technique to achieve superior energy efficiency compare to the traditional CMOS architecture. In CIM, multiplying to "W" bit requires less energy that "1". Therefore, we quantized the LLMs to the desired numbers where they have less ones in their binary representations that zero. We used a 2bit/cell resistive CIM test chip to evaluate out method. Since in the 2bit/cell RCIM E00 < E01 < E10 < E11, we demonstrated that our method achieved 2.6x less energy than the baseline method which eventually enabling running LLMs on edge devices.
Submission Number: 2
Loading