KV Cache as a Reasoning Primitive for Long Context Reasoning

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 4 pages)
Keywords: Machine Learning, Caching
Abstract: Large language models often produce inconsistent answers across multiple related questions when earlier premises are partially forgotten or distorted in long contexts. We argue this is not only a modeling issue but a working-memory issue: KV cache policy controls which premises remain accessible for attention and thus mediates logical consistency under finite memory. Current practice sits at two extremes: retain everything (wasteful) or evict uniformly (premise-destructive). This ignores decades of memory-hierarchy results on working sets and locality. We synthesize empirical evidence that attention working sets are sparse and structurally constrained (heavy hitters, attention sinks, layer heterogeneity), implying that premise-preserving retention is achievable. We provide a small proof-of-concept cache manager with content-aware retention and show favorable memory–quality tradeoffs on a premise-retrieval stress test (passkey retrieval). We then propose a “consistency bundle” evaluation protocol for measuring cross-question contra- dictions as a function of memory policy. Our conclusion is practical: memory policies should be designed and reported as reasoning controls, not just serving optimizations.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 150
Loading