Convergence of Near-Linear Width ReLU Networks with Unbalanced Initialization

Zhenghao Xu; Tuo Zhao; Rachel Ward; Molei Tao

Convergence of Near-Linear Width ReLU Networks with Unbalanced Initialization

Zhenghao Xu, Tuo Zhao, Rachel Ward, Molei Tao

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: overparameterization, ReLU activation, gradient descent, Nesterov's acceleration

Abstract: The optimization of neural networks is fundamental in machine learning. While the conjecture that linear width suffices for convergence has been confirmed in some restrictive settings, a significant gap remains for non-smooth ReLU networks, for which prior works require substantially wider, polynomial-width networks. We significantly narrow this gap by developing an analysis that simultaneously achieves near-linear width and accelerated convergence for two-layer ReLU networks with shared first layer and vector-valued outputs. Our results are enabled by a novel unbalanced Gaussian initialization that tightly controls the kernel shift for the non-smooth ReLU activation. We prove that gradient descent (GD) achieves linear convergence for networks with only $\tilde{\Omega}(Nn/\lambda)$ neurons, where $N$ is the sample size, $n$ is the output dimension, and $\lambda$ denotes the smallest eigenvalue of the (limiting) neural tangent kernel (NTK), which is standard in prior analyses operating in the NTK regime. Within the same framework, Nesterov's accelerated gradient (NAG) attains a provable speedup without sacrificing near-linear width, improving the iteration complexity from $O(n\kappa\log\frac{1}{\epsilon})$ to $O(\sqrt{n\kappa}\log\frac{1}{\epsilon})$, where $\kappa$ is the NTK condition number. Finally, our analysis establishes low-rank adaptivity: by introducing a sketching step at initialization and a subspace analysis, the width requirement reduces to $\tilde{\Omega}(Nr/\lambda)$ for responses of rank $r \ll n$. By tackling the key analytical hurdles of non-smoothness and vector output with a shared first layer, our work substantially tightens the required width for provable convergence in ReLU networks and brings theory closer to long-standing conjectures.

Primary Area: optimization

Submission Number: 23247

Loading