Regression Descent: A Statistical Framework for Neural Network Optimization

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Instead of taking gradient step, we solve a lower dimensional regression problem at every step, focussing updates on the Jacobian's row space to better capture the local geometry.
Abstract: We present Regression Descent (RD), a novel optimization algorithm for training deep neural networks that reformulates each gradient step as a regression problem in the span of the Jacobian. By leveraging the implicit function theorem in over-parameterized settings where the number of parameters exceed observations $(p > n)$, we project the optimization onto an $n$-dimensional subspace, enabling the use of statistical techniques and potentially improved conditioning. Our key insight is that in the over-parameterized regime, meaningful parameter updates lie in the row space of the Jacobian matrix, allowing us to solve a lower-dimensional regression problem with explicit regularization control. We establish convergence guarantees for RD under standard smoothness assumptions, showing that it achieves a convergence rate of $O(1/k)$ for smooth non-convex objectives. Furthermore, we prove that RD exhibits local linear convergence in neighborhoods of strict local minima, with the convergence rate dependent on the condition number of the regularized Gram matrix. The algorithm naturally handles the ill-conditioning common in neural network optimization through adaptive regularization and extends seamlessly to multi-output problems and mini-batch settings. Experimental results on Lorenz96, MNIST, and FMNIST datasets demonstrate that RD achieves up to 40\% faster convergence compared to SGD and Adam in terms of wall-clock time, with strong performance in the presence of activation function saturation. The computational overhead of solving $m \times m$ linear systems (where $m$ is the batch size) is offset by improved convergence properties and GPU-efficient operations. Our work opens new avenues for understanding neural network optimization through the lens of statistical regression, providing a practical algorithm for scenarios where standard gradient methods struggle.
Submission Number: 1767
Loading