Complete Characterization of Gauge Symmetries in Transformer Architectures

Hong Wang; Kelly Wang

Complete Characterization of Gauge Symmetries in Transformer Architectures

Hong Wang, Kelly Wang

Published: 29 Oct 2025, Last Modified: 27 Nov 2025NeurReps 2025 ProceedingsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Canonical Transformers, Lie Algebra, Gauge Symmetry, Gauge Group, Geometry

TL;DR: Maximal Transformer gauge: Gₘₐₓ = ((GL(dₖ))ʰ × (GL(dᵥ))ʰ) ⋊ Sₕ. RoPE → QK commutant. Layer-wise factorization. Experiments show invariance (≈10–25 ε). Enables gauge-aware optimization, alignment, and lossless compression.

Abstract: Modern Transformers possess redundant parameter symmetries that leave their function unchanged. We establish the complete gauge group structure for the canonical Transformer family, which encompasses standard architectures including GPT-2, BERT, LLaMA, and Qwen. For canonical Transformers with standard multi-head attention, we prove global maximality: the gauge group equals exactly G_max = ((GL(d_k))^h × (GL(d_v))^h) ⋊ S_h on the generic stratum where projection matrices have full column rank and head-wise attention controllability holds. For architectures with rotary position embeddings (RoPE) or relative encodings, as used in LLaMA and Qwen, the gauge group becomes G_RoPE = ((C_RoPE)^h × (GL(d_v))^h) ⋊ S_h where C_RoPE is the commutant of the position-dependent rotations—typically reducing to (GL(1,ℂ))^{d_k/2} for standard RoPE implementations. We prove maximality through three key results: characterizing the Lie algebra of infinitesimal symmetries as 𝔤_max = ⨁_{i=1}^h 𝔤𝔩(d_k) ⊕ ⨁_{i=1}^h 𝔤𝔩(d_v) for canonical models, establishing that attention weights must be preserved up to head permutation under gauge equivalence, and demonstrating that query–key and value–output transformations necessarily factorize independently. These gauge symmetries persist through LayerNorm and extend to complete architectures, with the full model gauge group being G_Model = ∏_{l=1}^L G_Layer^{(l)} Our characterization reveals over 1.1 million redundant dimensions in a 110M parameter Transformer Base model. Experiments on pretrained GPT-2 models from 124M to 1.5B parameters confirm that valid gauge transformations preserve model outputs to machine precision, while invalid transformations produce large errors, empirically supporting maximality.

Submission Number: 12

Loading