Keywords: Heterogeneous System, LLM Serving, Throughput Optimization
TL;DR: A distributed serving system that exploits CPUs to boost the throughput of LLM decoding procedure.
Abstract: Cost of serving long-sequence large language models (LLM)
is high, but the expensive and scarce GPUs are
poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged.
However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to generate more and longer sequences simultaneously. While they
could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck.
We find a way to decompose the transformer
models into two parts of different characteristics, one of which includes the memory-bound
KV-Cache accessing. Our key insight is that
the aggregated memory capacity, bandwidth,
and computing power of CPUs across multiple nodes is an efficient option to process this
part. Performance improvement comes from reduced data transmission overhead and boosted
GPU throughput to process the other model part.
Evaluation results show that our system achieves
1.88 × −5.04× the throughput of vLLM when
serving modern LLMs with the same GPU.
Submission Number: 42
Loading