CacheNotes: Offline Task-Aware KV Cache Compression for Reasoning-Intensive Knowledge Tasks

CacheNotes: Offline Task-Aware KV Cache Compression for Reasoning-Intensive Knowledge Tasks

ACL ARR 2025 May Submission6200 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Integrating external knowledge into Large Language Models (LLMs) is vital, yet methods like Retrieval-Augmented Generation (RAG) face limitations with broad queries requiring multi-source synthesis, while long-context models are computationally prohibitive. We propose Task-Aware Key-Value Cache Compression (CacheNotes), a novel query-agnostic framework that generates a reusable, compact cache tailored to a specific task. Unlike prior approaches, we first create a task-specific ‘cheat-sheet’ summary that guides a one-time compression of the corpus into a reusable KV-cache. This enables LLMs to efficiently answer diverse, reasoning-intensive queries using the compressed cache, eliminating repeated retrieval or context expansion. Experiments on LongBench show that CacheNotes outperforms standard RAG by up to 4 F1 points at a 20x compression rate, and delivers up to 4x lower latency, while remaining competitive with state-of-the-art query-aware baselines. Additional results on real-world enterprise and synthetic datasets demonstrate that CacheNotes is especially effective for multi-hop and broad-coverage queries.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: LLM Efficiency,Retrieval-augmented generation,Multihop QA,Document representation,Reasoning,NLP in resource-constrained settings

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6200

Loading