CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Model

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV cache replacement, LLM, Inference
TL;DR: We propose a novel KV cache replacement method termed CO2 to significantly reduce memory usage and increase token generation speed in LLMs, enhancing throughput by up to 1.44x without compromising output quality.
Abstract: The widespread adoption of Large Language Models (LLMs) such as ChatGPT has highlighted significant challenges in inference cost management due to their autoregressive nature, requiring sequential token generation. KV cache has been introduced to mitigate recomputation costs during inference but at the expense of increased GPU memory usage, especially as input and output lengths grow. We introduce the Cumulative Observation Oracle (CO2), a novel approach to optimize KV cache replacement based on a sophisticated scoring system. Our method leverages an extended observation period, a decay mechanism for attention scores, and optimizing FIFO cache size adjustment to efficiently manage cache space and reduce overall memory demands. Evaluation on models such as the OPT-6.7B and the Llama2-7B demonstrates that CO2 significantly reduces memory usage while maintaining output quality, leading to 1.44x and 1.32x faster token generation throughput in the OPT-6.7B and the Llama2-7B, respectively.
Submission Number: 47
Loading