Scaling-Law Analysis of SignSGD: From Feature-Space Linear Regression to LLM Pre-training

Binghui Li; Jianan Wang; Jinbo Wang; Lean Wang; Zilin Wang; Lei Wu

Scaling-Law Analysis of SignSGD: From Feature-Space Linear Regression to LLM Pre-training

Binghui Li, Jianan Wang, Jinbo Wang, Lean Wang, Zilin Wang, Lei Wu

Published: 02 Mar 2026, Last Modified: 20 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: SignSGD, Scaling Law, Phase Diagram, Data Scaling Efficiency

Abstract: Despite their widespread use in deep learning, the mechanisms underlying the effectiveness of adaptive gradient methods in large-scale training remain poorly understood. In this work, we provide a scaling-law analysis of SignSGD, a minimal yet expressive optimizer that captures the core coordinate-wise adaptivity shared by more sophisticated adaptive methods. We consider feature-space linear regression with power-law spectra, which allows us to precisely characterize the training dynamics of SignSGD. Specifically, we derive explicit scaling laws for SignSGD that accurately describe the loss dynamics. By further analyzing the data-limited regime, we characterize the phase diagram of SignSGD training and quantify the superiority of SignSGD in data scaling. We also show that SignSGD admits a substantially larger critical batch size than SGD, which gives SignSGD more benefits from large-batch training. Finally, we systematically validate our theoretical predictions through large-scale LLM pre-training experiments, demonstrating that the scaling laws uncovered here extend beyond the controlled model and are predictive of practical training behavior.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Challenge: This submission is an entry to the science of DL improvement challenge.

Submission Number: 99

Loading