Keywords: grokking, group theory, finite groups, CP tensor decomposition, Wedderburn decomposition, representation theory, mechanistic interpretability, equivariance, geometric deep learning, bilinear models, algebra discovery, inductive bias, scaling laws, Fourier analysis, modular arithmetic, non-abelian groups, Hamiltonian neural networks, symplectic integrators, conservation laws, molecular property prediction, QM9, phase transitions, double descent, optimization dynamics, implicit bias
TL;DR: FORGE embeds a learned group algebra into the network architecture, achieving 10× faster grokking than MLP+optimizer tricks, universal scaling across all finite group families, and 15% better molecular property prediction.
Abstract: Neural networks can implicitly discover algebraic structure through the grokking phenomenon, but prior mechanistic accounts are limited to cyclic groups and sparse Fourier representations; whether a learned algebraic prior can generalize to arbitrary finite groups and dramatically accelerate convergence remains open. We introduce FORGE: a rank-R bilinear product μ(a,b)=Σᵣ Wᵣ((Uᵣa)⊙(Vᵣb)) trained jointly with associativity, identity, and inverse algebra losses, which we prove is a CP factorization of the group multiplication tensor T_G. Six propositions characterize the mechanism: a Strassen-refined rank bound (rank_CP(T_{S₄})≤55 < 64 = Σ_ρ dim(ρ)³, improving the Wedderburn upper bound via Strassen's and Laderman's matrix-multiplication algorithms); mechanistic Wedderburn recovery; CP–isotypic alignment and Frobenius–Schur discrimination; a causal axiom-emergence theorem; and a universal sub-linear grokking-time scaling law. A four-way matched-budget ablation isolates the architecture as essential: on ℤ/97, MLP+Grokfast achieves 0.86× (a slowdown), while FORGE+Grokfast achieves 10.20×; matched MLPs fail on S₃, D₄, A₄ in all 9 runs (val=0.000), while FORGE groks all three in ∼10³ steps via a qualitatively distinct non-Fourier route (6 seeds, sign test p<0.016): effective harmonic modes ≈23 vs. ≈16. FORGE groks every finite group tested—abelian, dihedral, alternating, symmetric, and quaternionic—through A₇ (order 2,520) and ℤ_{1009} (order 1,009); on A₅ (smallest non-solvable group) FORGE achieves 10.1× speedup over MLP (1,800 vs. 18,267 steps) and 3.0× over MLP+Grokfast (5,467 steps). Grokking time follows a universal power law 732·|G|^{0.170} (R²=0.785, 20 groups, 68 seeds): a 168× increase in group order costs only 2.4× more steps. Mechanistic analysis recovers ≈1,630 of 1,639 Wedderburn conjugacy-class multiplicities exactly; identity axiom emergence causally precedes generalization by 583±80 steps in all 12 seed-runs, refuting the tautology hypothesis. Beyond group theory, FORGE reduces Hamiltonian invariant drift by 20–150,000× and improves QM9 U₀ MAE by 15.3%, establishing differentiable algebra discovery as a general-purpose inductive bias.
Submission Number: 61
Loading