Favorability of Loss Landscape with Regularization Requires Both Large Overparametrization and Initialization

13 Feb 2026 (modified: 21 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $\ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes favourable -- i.e., spurious local minima represent a negligible fraction of local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m \gtrsim \min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the favourability of the landscape.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jiawei_Zhang6
Submission Number: 7498
Loading