High-throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang

19 Apr 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. In this paper, we study how to lower the requirements of LLM inference down to one commodity GPU and achieve practical performance. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value cache. FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss. Compared with state-of-the-art offloading systems, FlexGen runs OPT-175B up to 100 faster on a single 16GB GPU and achieves a practical generation throughput of 1 token/s for the first time. FlexGen also comes with a pipeline parallelism runtime to allow super-linear scaling on decoding if more distributed GPUs are given.

0 Replies