Geometric Analysis of Token Selection in Multi-Head Attention

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, separability, explainability, context increasing, geometry
Abstract: We present a geometric framework for analyzing multi-head attention in large language models (LLMs). Instead of aggregating over all tokens, we propose a top-$N$ selection mechanism that retains only the most attended tokens and study its behavior directly in the value-state space. We introduce novel geometric metrics -- Precision, Recall, and F-score -- to quantify the separability of selected versus non-selected tokens, and derive dimension- and margin-dependent bounds under empirically motivated assumptions on norm stability, similarity decay, and multi-phase attention distributions. Our theoretical results clarify how head specialization, sequence length, and the sink token jointly shape the geometry of attention. Empirical evaluation on several open-source LLMs (LLaMA-2-7B, Gemma-7B, and Mistral-7B) confirms our predictions: top-$N$ selection sharpens token separability, the sink token systematically correlates with Recall, and different heads specialize into local versus global regimes. These findings demonstrate that attention is not only a weighting mechanism but also a structured geometric classifier. Our framework provides measurable criteria for token selection, offers interpretability into head-level behavior, and opens new directions for designing sparse and geometry-aware attention mechanisms in LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11875
Loading