Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

Wenjie Du; Li Jiang; Keda TAO; Xue Liu; Huan Wang

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

Wenjie Du, Li Jiang, Keda TAO, Xue Liu, Huan Wang

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We use reinforcement learning as a probe to identify the small subset of attention heads essential for chain-of-thought reasoning, then keep only those at full KV cache and aggressively compress the rest during inference.

Abstract: Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

Lay Summary: **Problem.** AI systems like DeepSeek-R1 and OpenAI o1 solve hard math and coding problems by "thinking out loud" — generating long chains of reasoning before answering. To follow its own train of thought, the model keeps everything it has written in a memory buffer that grows linearly with the response, and for long chains this memory dominates the cost of running the model. Existing tricks to shrink that memory work fine for short replies but break reasoning: the model forgets crucial steps and either falls into repetitive loops or never reaches an answer. **Solution.** A model's memory is divided across many small "attention heads," each tracking a different aspect of the input, and we hypothesized that only a small fraction actually carry the reasoning thread. To find them, we used reinforcement learning as a probe: we let the model try compressing each head and rewarded it whenever the final answer was still correct. The heads whose compression hurt the answer most are the ones essential for reasoning. **Impact.** Our method, RLKV, keeps full memory only for these critical heads and aggressively compresses the rest, cutting memory by 20–60% with near-lossless accuracy and delivering up to a 2× speedup on real serving engines. Beyond efficiency, it tells us which parts of a model actually do the reasoning — a small step toward understanding how these systems think.

Link To Code: https://github.com/kurt232/RLKV

Primary Area: Deep Learning->Large Language Models

Keywords: Reasoning LLM, KV Cache Compression

Originally Submitted PDF: pdf

Submission Number: 2422

Loading