Regression Descent: A Statistical Framework for Neural Network Optimization
TL;DR: Instead of taking gradient step, we solve a lower dimensional regression problem at every step, focussing updates on the Jacobian's row space to better capture the local geometry.
Abstract: We present Regression Descent (RD), a novel optimization algorithm for training deep neural networks that reformulates each gradient step as a regression problem in the span of the Jacobian. By leveraging the implicit function theorem in over-parameterized settings where the number of parameters exceed observations $(p > n)$, we project the optimization onto an $n$-dimensional subspace, enabling the use of statistical techniques and potentially improved conditioning. Our key insight is that in the over-parameterized regime, meaningful parameter updates lie in the row space of the Jacobian matrix, allowing us to solve a lower-dimensional regression problem with explicit regularization control. We establish convergence guarantees for RD under standard smoothness assumptions, showing that it achieves a convergence rate of $O(1/k)$ for smooth non-convex objectives. The algorithm naturally handles the ill-conditioning common in neural network optimization through adaptive regularization and extends seamlessly to multi-output problems and mini-batch settings. Experimental results on Lorenz96, MNIST, and FMNIST datasets demonstrate that RD achieves up to 40\% faster convergence compared to SGD and Adam in terms of wall-clock time, with strong performance in the presence of activation function saturation. The computational overhead of solving $m \times m$ linear systems (where $m$ is the batch size) is offset by improved convergence properties and GPU-efficient operations. Our work opens new avenues for understanding neural network optimization through the lens of statistical regression, providing a practical algorithm for scenarios where standard gradient methods struggle.
Submission Number: 1767
Loading