Keywords: Transformer gauge symmetry, Gauge groups (maximality), Multi-Head Attention (MHA), Mixture-of-Experts (MoE) invariance
TL;DR: We prove the maximal gauge groups of Transformer attention, extend to RoPE and GQA/MQA, and show MoE invariance. Layerwise structure enables lossless KV-basis rewrites and h/g multiplicative savings.
Abstract: Modern Transformers possess redundant parameter symmetries that leave their function unchanged. We establish the complete gauge-group structure for the canonical Transformer family, which encompasses standard architectures including GPT-2, BERT, LLaMA, and Qwen. For canonical Transformers with standard multi-head attention, we prove global maximality: the gauge group equals exactly
G_max = ((GL(dₖ))^h × (GL(dᵥ))^h) ⋊ Sₕ
on the generic stratum where projection matrices have full column rank and head-wise attention controllability holds. For architectures with rotary position embeddings (RoPE) or relative encodings, as used in LLaMA and Qwen, the gauge group becomes
G_RoPE = ((𝒞_RoPE)^h × (GL(dᵥ))^h) ⋊ Sₕ,
where 𝒞_RoPE is the commutant of the position-dependent rotations—typically reducing to (GL(1, ℂ))^(dₖ/2) for standard RoPE implementations. We prove maximality through three key results: characterizing the Lie-algebra of infinitesimal symmetries as
𝔤_max = ⊕_{i=1}^h gl(dₖ) ⊕ ⊕_{i=1}^h gl(dᵥ)
for canonical models; establishing that attention weights must be preserved (up to head permutation) under gauge equivalence; and demonstrating that query–key and value–output transformations necessarily factorize independently. These gauge symmetries persist through LayerNorm and extend to complete architectures, with the full-model gauge group
G_Model = ∏_{l=1}^{L} G_Layer^(l).
Our characterization reveals over 1.1 million redundant dimensions in a 110M-parameter Transformer-Base model. Experiments confirm that gauge transformations preserve outputs to within 24ε_mach relative error across diverse architectures, while transformations outside G_max produce O(1) changes—empirically supporting maximality.
For grouped/multi-query attention (GQA/MQA), the admissible query–key and value transforms are tied per K/V group, yielding a reduced symmetry
G_share = ((GL(dₖ))^g × (GL(dᵥ))^g) ⋊ (Sₕ × S_g)
(RoPE: GL(dₖ) → 𝒞_RoPE), and standard top-k MoE routing is invariant to all gauge transformations.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3587
Loading