Keywords: Generative AI, large language models, quantization, sparsity, compute-in-memory, energy-efficiency
TL;DR: We proposed a CIM aware post-training quantization for highly energy-efficient large language modes (LLMs)
Abstract: Large language models (LLMs) with generative capabilities have showcased outstanding performance across a diverse
applications. However, deploying these models on resource- constrained edge and mobile devices during inference mode
is very challenging due to their massive computations. In this paper, we proposed a compute-in-memory (CIM) aware
post-training quantization and sparsity technique to achieve superior energy efficiency compare to the traditional CMOS
architecture. In CIM, multiplying to "W" bit requires less energy that "1". Therefore, we quantized the LLMs to the
desired numbers where they have less ones in their binary representations that zero. We used a 2bit/cell resistive CIM
test chip to evaluate out method. Since in the 2bit/cell RCIM E00 < E01 < E10 < E11, we demonstrated that our method
achieved 2.6x less energy than the baseline method which eventually enabling running LLMs on edge devices.
Submission Number: 2
Loading