Keywords: Transformers, Optimal Transport, Geometry, Representation Learning and Evolution
TL;DR: We prove transformer self-attention exactly solves semi-relaxed optimal transport with unit regularization, revealing geometric constraints that explain how representations evolve through network depth while maintaining stability.
Abstract: We prove that transformer self-attention matrices are exactly the optimal solutions to semi-relaxed entropic optimal transport problems with unit regularization (ε=1). This mathematical equivalence—not an approximation or analogy—reveals that attention mechanisms inherently solve a specific optimal transport problem where each query independently redistributes unit mass across keys. The semi-relaxed formulation is essential for causality in autoregressive models, as balanced OT would require adjusting past token representations. From this fundamental equivalence, we derive tight bounds showing that probability distributions induced by fixed probes evolve with total variation bounded by ∥W_out^⊤∥_{2→∞}∥h^(ℓ+1)-h^(ℓ)∥_2. Through comprehensive empirical analysis of GPT-2 models (124M–1.5B parameters), we validate these theoretical predictions and discover an unexpected saturation phenomenon: when softmax confidence exceeds 0.9999, representations lock completely (TV < 10^{-10}) while hidden states continue evolving by 2–9%, revealing a mechanism for separating decision certainty from continued computation. Our framework provides the first exact optimal transport characterization of attention, explaining fundamental constraints on transformer representation dynamics.
Submission Number: 8
Loading