Keywords: Large Language Models, Pre-Softmax Analysis, Attention Logits, Relative Contextualization, KV-Compression, Attribution
TL;DR: RCStat leverages raw, pre-Softmax attention logits to quantify token importance at multiple granularities, enabling adaptive KV-cache eviction and high-fidelity attributions, all without any model retraining.
Abstract: Estimating the importance of input tokens and their activations in auto-regressive models is a fundamental requirement in many applications, such as key-value (KV) cache compression and attribution. Prior work computes token importance using attention weights, which are obtained by normalizing the raw attention logits (query-key inner products) with a softmax operation. However, the softmax normalization suppresses the rich information within the attention logits. We introduce RCStat, a statistical framework that harnesses the raw attention logits via Relative Contextualization (RC) -- a random variable measuring contextual influence from one subset of tokens to another. We derive computationally efficient bounds on the expected RC and demonstrate its utility in two applications: (i) KV compression, where RC‐based adaptive thresholding evicts substantial portions of the KV cache with minimal quality loss in token generation; and (ii) Attribution, where attention heads with high expected RC yield accurate span‐level attribution. Across QA, summarization, and attribution benchmarks, RCStat achieves state-of-the-art performance, improving generation quality by 15–40\% and attribution accuracy by 2–16\%, all without any model retraining.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 13660
Loading