Keywords: Parallel training, gradient descent, stochastic optimizatiog
Abstract: This paper approaches the fundamental challenge of accelerating the inherently autoregressive nature of gradient descent (GD) like SGD and Adam through a dynamic system perspective.
Specifically, we introduce a unified framework that recasts the autoregressive GD process as solving a system of triangular nonlinear equations (TNEs), thereby facilitating a paradigm shift toward non-autoregressive GD featuring parallel gradient computation across iteration steps. Within this generic framework, we establish that: (1) the TNE system admits a unique solution corresponding precisely to the autoregressive GD iterative trajectory; (2) solving the TNEs system guarantees convergence to the GD iterative trajectory in equal or far fewer iterations.
Building on these insights, we present \textit{PASO}, a step parallel optimizer for accelerating a broad class of autoregressive GD optimizers like SGD and Adam.
Extensive experiments (\textit{e.g.}, Llama-3.2-1B) validate that PASO achieves up to \textbf{91}$\times$ reduction in GD steps and \textbf{7.5}$\times$ speedup in wall-clock time, with no measurable model quality loss. The source code will be released publicly.
Primary Area: optimization
Submission Number: 4074
Loading