Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

ICLR 2026 Conference Submission22708 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scaling laws, signSGD, SGD, compute-optimal curves, power-law random feature, stable-decay schedule
TL;DR: signSGD sharpens compute-optimal scaling in PLRF for noise bottleneck regime.
Abstract: We study scaling laws of signSGD under a power-law random features (PLRF) model that accounts for both feature and target decay. We analyze the expected population risk of a linear model trained with one-pass signSGD on Gaussian-sketched features. We express the risk as a function of model size, training steps, learning rate, and the feature and target decay parameters. Comparing against the SGD risk analyzed by Paquette et al. (2024), we identify a drift-normalization effect and a noise-reshaping effect unique to signSGD. We then obtain compute-optimal scaling laws under the optimal choice of learning rate. Our analysis shows that the noise-reshaping effect can make the compute-optimal slope of signSGD steeper than that of SGD in regimes where noise is dominant. Finally, we observe that a stable-decay schedule—a simplified variant of the widely used warmup-stable-decay (WSD) schedule—further reduces the noise term and sharpens the compute-optimal slope, when feature decay is fast but target decay is slow.
Primary Area: learning theory
Submission Number: 22708
Loading