Maximal Gauge Symmetry in Transformer Architectures

Hong Wang; Kelly Wang

Maximal Gauge Symmetry in Transformer Architectures

Hong Wang, Kelly Wang

10 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer gauge symmetry, Gauge groups (maximality), Multi-Head Attention (MHA), Mixture-of-Experts (MoE) invariance

TL;DR: We prove the maximal gauge groups of Transformer attention, extend to RoPE and GQA/MQA, and show MoE invariance. Layerwise structure enables lossless KV-basis rewrites and h/g multiplicative savings.

Abstract: Modern Transformers possess redundant parameter symmetries that leave their function unchanged. We establish the complete gauge-group structure for the canonical Transformer family, which encompasses standard architectures including GPT-2, BERT, LLaMA, and Qwen. For canonical Transformers with standard multi-head attention, we prove global maximality: the gauge group equals exactly G_max = ((GL(dₖ))^h × (GL(dᵥ))^h) ⋊ Sₕ on the generic stratum where projection matrices have full column rank and head-wise attention controllability holds. For architectures with rotary position embeddings (RoPE) or relative encodings, as used in LLaMA and Qwen, the gauge group becomes G_RoPE = ((𝒞_RoPE)^h × (GL(dᵥ))^h) ⋊ Sₕ, where 𝒞_RoPE is the commutant of the position-dependent rotations—typically reducing to (GL(1, ℂ))^(dₖ/2) for standard RoPE implementations. We prove maximality through three key results: characterizing the Lie-algebra of infinitesimal symmetries as 𝔤_max = ⊕_{i=1}^h gl(dₖ) ⊕ ⊕_{i=1}^h gl(dᵥ) for canonical models; establishing that attention weights must be preserved (up to head permutation) under gauge equivalence; and demonstrating that query–key and value–output transformations necessarily factorize independently. These gauge symmetries persist through LayerNorm and extend to complete architectures, with the full-model gauge group G_Model = ∏_{l=1}^{L} G_Layer^(l). Our characterization reveals over 1.1 million redundant dimensions in a 110M-parameter Transformer-Base model. Experiments confirm that gauge transformations preserve outputs to within 24ε_mach relative error across diverse architectures, while transformations outside G_max produce O(1) changes—empirically supporting maximality. For grouped/multi-query attention (GQA/MQA), the admissible query–key and value transforms are tied per K/V group, yielding a reduced symmetry G_share = ((GL(dₖ))^g × (GL(dᵥ))^g) ⋊ (Sₕ × S_g) (RoPE: GL(dₖ) → 𝒞_RoPE), and standard top-k MoE routing is invariant to all gauge transformations.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3587

Loading