Divine Benevolence is an $x^2$: GLUs have asymptotically faster scaling laws than MLPs

Published: 02 Mar 2026, Last Modified: 16 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: scaling laws, gated linear units, numerical analysis
TL;DR: A numerical analysis lens reveals GLUs’ hiddens \(x^2\) yields faster asymptotic scaling than MLPs ($L\propto P^{-3}$ vs. $P^{-2}$) on 1D reconstruction.
Abstract: Scaling laws can be understood from ground-up numerical analysis, where traditional function approximation theory can explain shifts in model architecture choices. GLU variants now dominate frontier LLMs and similar outer-product architectures are prevalent in ranking models. The success of these architectures has mostly been left as an empirical discovery. We apply the tools of numerical analysis to expose a key factor: these models have an $x^2$ which enables \emph{asymptotically} faster scaling than MLPs. GLUs have piecewise quadratic functional forms that are sufficient to exhibit quadratic order of approximation. The $L(P)$ scaling slope is $L(P)\propto P^{-3}$ for GLUs but only $L(P)\propto P^{-2}$ for MLPs. We provide a parameter construction and empirical verification of these slopes for low dimension function approximation on synthetic and real data. From the first principles we discover, we make one stride and propose the ``Gated Quadratic Unit'' which has an even steeper $L(P)$ slope than the GLU and MLP. This opens the possibility of architecture design from first principles numerical theory to unlock superior scaling in large models. Replication code is available at \url{https://github.com/afqueiruga/divine_scaling}.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 82
Loading