Abstract: Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative
language models, the inference process involves two primary phases: prompt processing and token generation.
Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix
multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth
due to the overhead of transferring weights and KV cache values from the memory system to the computing units.
This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive
text generation, both of which are increasingly crucial for LLMs.
This paper introduces “Keyformer”, an innovative inference-time approach, to mitigate the challenges associated
with KV cache size. Keyformer leverages the observation that approximately 90% of the attention weight in
generative inference focuses on a specific subset of tokens, referred to as “key” tokens. Keyformer retains only
the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach
reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We
evaluate Keyformer’s performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which
employ various positional embedding algorithms. Our assessment uses a variety of tasks, with an emphasis on
summarization and conversation tasks involving extended contexts. We show that Keyformer reduces inference
latency by 2.1× and improves token generation throughput by 2.4×, while preserving the model’s accuracy
Loading