On the Dynamics of Learning Linear Functions with Neural Networks

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: non-convex optimization, feature learning, global convergence, relu networks
Abstract: This paper studies the gradient descent training dynamics of fitting a one-hidden-layer network with multi-dimensional outputs to linear target functions. That is, we focus on a realizable model where the inputs are drawn i.i.d. from a Gaussian distribution and the labels are generated according to a planted linear model with multiple outputs. This framework serves as a good model for a variety of interesting problems including end-to-end training in inverse problems and various auto-encoder models in machine learning. Despite the seemingly simple formulation, understanding training dynamics is a challenging unresolved problem. This is in part due to the fact that the training landscape contains multiple non-strict saddle points and it is completely unclear why gradient descent from random initialization is able to escape such bad stationary points. In this work, we develop the first comprehensive analysis of the gradient descent dynamics for learning linear target functions with ReLU networks. We show that gradient descent with moderately small random initialization converges to a global minimizer at a linear rate. To rigorously show that GD avoids non-strict saddle points, we develop intricate techniques to decompose the loss and control the GD trajectory, which may have broader implications for the analysis of non-convex optimization problems involving non-strict saddles. We corroborate our theoretical results with extensive experiments with various configurations.
Primary Area: optimization
Submission Number: 14271
Loading