Implicit $\ell^1$-regularization of positively quadratically reparameterized linear regression: precise upper and lower bounds

Hannes Matt; Dominik Stöger

Implicit $\ell^1$-regularization of positively quadratically reparameterized linear regression: precise upper and lower bounds

Hannes Matt, Dominik Stöger

Published: 25 Mar 2025, Last Modified: 20 May 2025SampTA 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Session: General

Keywords: Non-convex optimization, Deep learning, Implicit bias, Overparameterization

Abstract: Modern neural networks are often trained in a setting where the number of parameters vastly exceeds the number of training samples. While statistical folklore might suggest overfitting due to the huge capacity of these models, they show remarkable performance in practice, even if no regularization is applied at all. To explain this phenomenon, it has been conjectured that the training algorithm itself is biased towards models of low capacity by implicitly regularizing the model. While such an explanation remains elusive for deep neural networks, significant progress has been made for simpler models. In order to understand the implicit regularization of gradient flow, diagonal linear neural networks have been studied extensively. It was observed that for a sufficiently small initialization, gradient flow converges towards the model with almost smallest $\ell^1$-norm among all models that perfectly interpolate the training data. In this work, we study positive diagonal linear neural networks of depth $D=2$ in a regression task (a.k.a. quadratically reparameterized linear regression). We analyze the approximation error between the limit of the gradient flow and the solution of the $\ell^1$-minimization problem. We derive precise upper and lower bounds on the approximation error in dependence of the scale of initialization $\alpha$: the error decays with rate $\alpha^{1-\varrho}$, where $\varrho<1$. Furthermore, $\varrho$ can be explicitly characterized and is closely related to quantities prominent in the field of compressive sensing. Our upper bounds improve on previous work in the literature, and, to the best of our knowledge, no lower bounds were available before.

Submission Number: 58

Loading