Gauge Fiber Bundle Geometry of Transformers

ICLR 2026 Conference Submission19168 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Principal bundle geometry, Gauge symmetry of Transformers, Fisher–Rao natural gradient, Attention curvature & holonomy, RoPE commutant reduction
TL;DR: Principal-bundle geometry for GeLU Transformers: maximal gauge symmetry (free & proper), Fisher–Rao connection with natural-gradient as horizontal Riesz; attention has nonzero curvature. Empirical checks use Euclidean proxies.
Abstract: We give a geometry-first account of Transformers with GeLU. On a generic regular set of parameters, the head-wise symmetry group acts freely and properly, so the parameter space fibers over a quotient of functionally distinct models—a clean principal-bundle picture with gauge orbits as fibers and function-changing directions as horizontals. Using the empirical Fisher (Fisher-Rao) metric yields a canonical horizontal distribution and clarifies that the natural gradient is the horizontal Riesz representative of the Euclidean gradient (reducing to orthogonal projection only in a special case). Within this framework, attention behaves like a connection with generically nonzero curvature (path-dependent transport), while the feed-forward block is largely fiber-preserving with a dimension-controlled near-orthogonality to attention. We turn these ideas into practical diagnostics—a least-squares, gauge-aware gradient split and a small-loop holonomy estimator—then report Euclidean-proxy consistency checks aligning with the theory; full Fisher-Rao evaluations are presented as algorithms for future work. Architectural choices such as RoPE appear as principled gauge reductions (e.g., per-head Q/K dimension from d_k² to d_k).
Primary Area: learning theory
Submission Number: 19168
Loading