Keywords: Canonical Transformers, Lie Algebra, Gauge Symmetry, Gauge Group, Geometry
TL;DR: Maximal Transformer gauge: Gₘₐₓ = ((GL(dₖ))ʰ × (GL(dᵥ))ʰ) ⋊ Sₕ. RoPE → QK commutant. Layer-wise factorization. Experiments show invariance (≈10–25 ε). Enables gauge-aware optimization, alignment, and lossless compression.
Abstract: Modern Transformers possess redundant parameter symmetries that leave their
function unchanged. We establish the complete gauge group structure for the
canonical Transformer family, which encompasses standard architectures
including GPT-2, BERT, LLaMA, and Qwen.
For canonical Transformers with standard multi-head attention, we prove global
maximality: the gauge group equals exactly
G_max = ((GL(d_k))^h × (GL(d_v))^h) ⋊ S_h
on the generic stratum where projection matrices have full column rank and
head-wise attention controllability holds.
For architectures with rotary position embeddings (RoPE) or relative encodings,
as used in LLaMA and Qwen, the gauge group becomes
G_RoPE = ((C_RoPE)^h × (GL(d_v))^h) ⋊ S_h
where C_RoPE is the commutant of the position-dependent rotations—typically
reducing to (GL(1,ℂ))^{d_k/2} for standard RoPE implementations.
We prove maximality through three key results: characterizing the Lie algebra
of infinitesimal symmetries as
𝔤_max = ⨁_{i=1}^h 𝔤𝔩(d_k) ⊕ ⨁_{i=1}^h 𝔤𝔩(d_v)
for canonical models, establishing that attention weights must be preserved up
to head permutation under gauge equivalence, and demonstrating that query–key
and value–output transformations necessarily factorize independently.
These gauge symmetries persist through LayerNorm and extend to complete
architectures, with the full model gauge group being
G_Model = ∏_{l=1}^L G_Layer^{(l)}
Our characterization reveals over 1.1 million redundant dimensions in a 110M
parameter Transformer Base model. Experiments on pretrained GPT-2 models from
124M to 1.5B parameters confirm that valid gauge transformations preserve model
outputs to machine precision, while invalid transformations produce large
errors, empirically supporting maximality.
Submission Number: 12
Loading