Keywords: Optimizer-state quantization; quantized EMA dynamics; memory-efficient training; subtractive dithering.
Abstract: Low-bit storage of optimizer states can reduce the memory footprint of large-scale language-model training, but for states updated by exponential moving averages (EMAs), it also changes the optimizer dynamics. At every step, the stored state is dequantized, updated, requantized, and written back, so quantization errors are recursively fed into future optimizer states.
We analyze the quantized EMA recursion and identify a failure mode: biased quantization errors accumulate into persistent drift, error variance is amplified, and the aggregate state error scales with state dimensionality.
We propose **BReD** (**B**lock **Re**play **D**ithering), which adapts classical subtractive dithering to optimizer-state storage using deterministic seed replay and block-wise shared dither values. BReD leaves optimizer updates and hyperparameters unchanged, stores no auxiliary random tensors, and reduces random-number generation from $O(d)$ to $O(d/B)$ for state dimension $d$ and block size $B$. In controlled 120M pretraining experiments across five optimizers, 4-bit BReD remains stable where nearest and stochastic rounding diverge or degrade. In AdamW and Muon pretraining at 1.1B and 3.4B parameters, 4-bit BReD closely tracks full-precision optimizer-state baselines: validation perplexity shifts by at most $0.3$, average zero-shot accuracy changes by at most $0.5$ points, and optimizer-state memory is reduced by up to $86.0\%$ in the evaluated settings. These results suggest that stable low-bit EMA-state storage is a practical option for memory-constrained scaling.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 76
Loading