Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Abstract: Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache.
Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input.
Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs.
To overcome this, we propose Locret, a framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units using \textit{retaining heads}, Locret enables precise eviction of cache units, facilitating efficient long-context inference.
In our empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality
--- Locret achieves up to $20\times$ of KV cache compression ratio within less than $10\%$ performance loss.
Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs $<1$ GPU hour of additional training.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hugo_Touvron1
Submission Number: 5682
Loading