Geometric Compression in Grokking: The Three-Stage Modular Dynamics of Transformers

ICLR 2026 Conference Submission25540 Authors

20 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Grokking, Geometric Deep Learning, Transformers
TL;DR: We show that grokking in Transformers is not monotonic simplification, but a "construct-then-compress" algorithm where the Self-Attention module must first increase its geometric complexity to enable a subsequent, rapid compression in the FFN.
Abstract: A central puzzle in deep learning is how generalized algorithms emerge from training dynamics, particularly in the phenomenon of grokking. Existing approaches track function complexity (Linear Mapping Number) or representation dimensionality (Local Intrinsic Dimension). We take a different perspective: a unified algorithm should manifest as geometrically consistent transformations across inputs. We introduce the \textbf{Geometric Coherence Score (GCS)}, which measures the directional alignment of local Jacobian transformations across the data manifold. GCS provides a geometric signature of mechanistic unity—consistent transformations indicate a unified computational strategy, while scattered transformations suggest input-specific memorization. Combined with a fixed final geometry protocol that isolates mechanistic evolution from geometric drift, GCS reveals a \textbf{Construct-then-Compress} dynamic invisible to complexity or dimensionality metrics. In single-layer Transformers, this dynamic unfolds in three distinct phases: (1) \textit{Coherence Collapse}, where initial symmetry breaks to memorize data; (2) \textit{Asynchronous Construction and Compression}, a critical silent phase where Attention initiates geometric reorganization, followed by MLP with temporal offset; and (3) \textit{Post-Grokking Refinement}, where the mechanism consolidates into a unified solution. We validate the construct-then-compress principle across activation functions (ReLU, GeLU, SiLU) and modular tasks (addition, subtraction, multiplication, division), establishing GCS as a principled diagnostic tool. Extending to multi-layer networks (2--3 layers), we observe that final layers exhibit iterative construct-compress cycles rather than a single three-phase trajectory, while early layers show path-specific stability. These findings reveal depth-dependent dynamics that warrant further investigation into how hierarchical structure shapes algorithmic formation.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 25540
Loading