Abstract: The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. In this paper, we study how to lower the requirements of LLM inference down to one commodity GPU and achieve practical performance. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for the best pattern to store and access the tensors, including weights, activations, and attention key/value cache. FlexGen further compresses both weights and KV cache to 4 bits with negligible accuracy loss. Compared with state-of-the-art offloading systems, FlexGen runs OPT-175B up to 100 faster on a single 16GB GPU and achieves a practical generation throughput of 1 token/s for the first time. FlexGen also comes with a pipeline parallelism runtime to allow super-linear scaling on decoding if more distributed GPUs are given.
0 Replies
Loading