Track: tiny paper (up to 4 pages)
Keywords: Transformers; Geometry of representations; Persistent homology
Abstract: We propose a topological framework to analyze the layerwise evolution of transformer representations by modeling attention heads as Markov kernels on a token metric space. This formulation admits a Wasserstein-1 ($W_1$) lifting where coarse Ollivier-Ricci curvature provides quantitative bounds on the action of the induced operator. A positive curvature implies layerwise Wasserstein contraction while negative implies expansion. To connect these statements to practice, we introduce a reproducible probe that estimates robust curvature lower quantiles, directly tests contraction on random measures in $W_1$, and tracks layerwise topological simplification using persistent homology on diffusion-induced distances. In pretrained GPT-2 and GPT-2-medium models, we observe a depthwise transition toward more contractive support, with shrinking ($H_1$) lifetimes and persistence of a coarse ($H_0$) skeleton.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 86
Loading