Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Yuxiang Huang; Binhang Yuan; Xu Han; Chaojun Xiao; Zhiyuan Liu

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu

Published: 28 Dec 2025, Last Modified: 28 Dec 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache. Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input. Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs. To overcome this, we propose Locret, a framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units using \textit{retaining heads}, Locret enables precise eviction of cache units, facilitating efficient long-context inference. In our empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality --- Locret achieves up to $20\times$ of KV cache compression ratio within less than $10\%$ performance loss. Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs $<1$ GPU hour of additional training.

Certifications: J2C Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/huangyuxiang03/Locret

Supplementary Material: zip

Assigned Action Editor: ~Hugo_Touvron1

Submission Number: 5682

Loading