Generating Efficient Kernels for Quantized Inference on Large Language Models

Tommaso Pegolotti; Elias Frantar; Dan Alistarh; Markus Püschel

Generating Efficient Kernels for Quantized Inference on Large Language Models

Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

Published: 20 Jun 2023, Last Modified: 16 Jul 2023ES-FoMO 2023 PosterEveryoneRevisionsBibTeX

Keywords: Code Generation, Large Language Models, LLM, Quantization, Model Compression, GPTQ, LlaMA

TL;DR: We generate kernels using CPU specific parameters to improve inference performance on Large Language Models (LlaMA).

Abstract: We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.

Submission Number: 57

Loading