Impact of Low-rank Attention dynamics, Spectral Amplification and Grouped Query Attention (GQA) on the Reasoning and Stability of LLMs

07 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer Dynamics, Grouped Query Attention (GQA), Multi-Head Attention (MHA), Spectral Amplification, Representational Collapse, Chain-of-Thought (CoT)
TL;DR: Multi-Head Attention suffers from rapid mode collapse. This paper proves Grouped Query Attention acts as a structural regularizer that flattens the eigenvalue spectrum, significantly improving LLM multi-hop reasoning and stability.
Abstract: Transformer architectures propagate token representations through repeated applications of attention operators. While empirically powerful, these operators may exhibit specific spectral properties that shape representation, evolution, adversarial sensitivity, projection, stability, and information propagation across the depth of several layers. In this work, we propose a strong theoretical framework to analyze transformer attention from a dynamical systems perspective. Under this framework, we present theorems focusing on the eigen-properties of attention operators that correspond to the modes governing information propagation and reasoning. Spectral amplification in attention corresponds to unstable modes that magnify perturbations exponentially with depth. We show that Grouped Query Attention (GQA) imposes a structural constraint that suppresses unstable spectral modes. We theorize and show that transformer layers govern representation dynamics through their spectral structure, and GQA imposes a low-rank approximation and bounds adversarial amplification energy. These results provide a principled explanation for the empirical stability and efficiency of grouped attention mechanisms. We also validate the results experimentally using the architecture and weights of some of the recent state-of-the-art open-source small LLMs such as Phi-2 and Phi-4. We provide quantitative results based on spectral examination and draw insights comparing MHA-driven architectures and GQA-driven architectures (including spectral entropy, effective rank, spectral decay rate, Gini coefficient, top-k energy fraction, and BTL reasoning scores) and their ramifications on reasoning and stability.
Submission Number: 73
Loading