Stochastic Gradient Descent on the Linear Bigram Model: Bias-Variance Scaling and Critical Batch Size

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: stochastic gradient descent, critical batch size, bias-variance decomposition, linear bigram model, Zipf's law, scaling laws
Abstract: The critical batch size, beyond which data parallelism gives diminishing returns, is a key factor in language-model pretraining. Existing finite-time least-squares theory provides useful templates for predicting it, but does not directly give finite-vocabulary Zipf rates with explicit dependence on the vocabulary size. The linear bigram model, which fits a next-token transition matrix under squared loss with power-law token frequencies, provides a tractable setting for this regime. Its one-hot sampling structure also departs from prior vector analyses, and so far it has only been studied in the deterministic full-batch case. We give a finite-time analysis of mini-batch SGD on this model under power-law token distributions. Our main result is an exact, closed-form bias--variance decomposition of the expected loss, in which the bias term equals the loss of deterministic gradient descent and the variance term captures the cost of mini-batch noise. From this decomposition we obtain scaling laws for the bias and the variance, governed by a frequency cutoff that separates rows with enough effective updates from rarer under-trained rows. When this cutoff reaches the full vocabulary, the learning curve changes phase. Balancing bias and variance yields the scaling of the critical batch size. We confirm the predicted scaling on simulated bigrams and on bigram statistics estimated from OpenWebText.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 163
Loading