Keywords: Efficient LLM Inference, KV Cache, Bandwidth Effcient, Memory Efficient Serving, Offloading.
TL;DR: We present a novel approach named MAPLE to manage the KV cache for LLMs, which effectively addressing efficiency and accuracy concerns by predict and load only important KV pairs from KV cache.
Abstract: Large Language Models (LLMs) perform well across various natural language processing (NLP) tasks. However, the inference for extensive text generation faces challenges due to the significant memory demands of the key-value (KV) cache, which scales with sequence length. In this paper, we introduce a novel, bandwidth-efficient method for managing the KV cache. Utilizing learning-based techniques, our method predicts and retrieves only the essential KV entries, thereby eliminating the need for comprehensive KV pair transfers. Distinct from previous approaches, our method decouples the prediction phase from the computation phases by storing low-rank Keys in HBM, drastically reducing bandwidth consumption while minimally impacting memory usage and maintaining accuracy.
Submission Number: 75
Loading