Abstract: Integrating external knowledge into Large Language Models (LLMs) is vital, yet methods like Retrieval-Augmented Generation (RAG) face limitations with broad queries requiring multi-source synthesis, while long-context models are computationally prohibitive.
We propose Task-Aware Key-Value Cache Compression (CacheNotes), a novel query-agnostic framework that generates a reusable, compact cache tailored to a specific task. Unlike prior approaches, we first create a task-specific ‘cheat-sheet’ summary that guides a one-time compression of the corpus into a reusable KV-cache. This enables LLMs to efficiently answer diverse, reasoning-intensive queries using the compressed cache, eliminating repeated retrieval or context expansion.
Experiments on LongBench show that CacheNotes outperforms standard RAG by up to 4 F1 points at a 20x compression rate, and delivers up to 4x lower latency, while remaining competitive with state-of-the-art query-aware baselines. Additional results on real-world enterprise and synthetic datasets demonstrate that CacheNotes is especially effective for multi-hop and broad-coverage queries.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: LLM Efficiency,Retrieval-augmented generation,Multihop QA,Document representation,Reasoning,NLP in resource-constrained settings
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 6200
Loading