Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: LLMs' attention layers exhibit concentrated massive values in Q and K (but not V) due to RoPE, which proves crucial for contextual knowledge understanding rather than parametric knowledge retrieval.
Abstract: Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The code is available at https://github.com/MingyuJ666/Rope_with_LLM.
Lay Summary: Problem: Large Language Models (LLMs) like ChatGPT understand context through attention mechanisms, but it remains unclear why certain attention values become extremely large and are highly concentrated in specific dimensions. Solution: We discovered, for the first time, that these concentrated massive values exclusively occur in the query (Q) and key (K) vectors of the attention module, not in the value (V) vectors. Through extensive experiments and quantization analysis, we uncovered their origins and mechanisms. Impact: We demonstrate that these massive values are critical for models to interpret contextual knowledge (information from the current input) rather than retrieving parametric memory (information stored within the model). This provides new insights into optimizing attention mechanisms in LLMs.
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Massive Value, LLM, Contextual Knowledge Understanding
Submission Number: 521
Loading