Task-Aware Dynamic KV Cache Eviction via Error-Driven Importance Estimation for Efficient LLM inference

Task-Aware Dynamic KV Cache Eviction via Error-Driven Importance Estimation for Efficient LLM inference

ACL ARR 2025 May Submission886 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The increasing context length in Large Language Models (LLMs) introduces significant memory overhead due to the rapid growth of the Key-Value (KV) cache. Recent works have explored KV cache eviction methods for KV cache reduction in LLMs. However, these methods typically rely on simplistic calculations for KV cache importance and apply rigid cache allocation strategies across layers, resulting in notable performance degradation. To address these limitations, we propose an Error-Driven Importance Estimation (EDIE) method that rigorously quantifies token criticality based on the output error of the attention, and build upon it a Task-Aware Dynamic Allocation (TADA) mechanism that optimizes the layer-specific allocation of KV cache capacity based on the task complexity and layer importance. Experiments show consistent accuracy gains on LongBench tasks, surpassing prior methods across cache budgets. Notably, on the Needle-in-a-Haystack task, our method achieves up to 15.7\% absolute accuracy gain under extreme cache constraints.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings, retrieval, Inference methods

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Keywords: LLM Efficiency, NLP in resource-constrained settings, retrieval, Inference methods

Submission Number: 886

Loading