Keywords: M+Adam, low-precision training, mixed-precision optimization, mantissa-exponent factorization, multiplicative updates, optimizer stability, Adam, Madam
Abstract: Training large language models (LLMs) in full precision (FP32) is increasingly constrained by memory, compute, and energy demands. Low-precision formats such as BF16, which modern accelerators are optimized for, offer substantial gains, reducing memory footprint, improving throughput, and lowering energy consumption. However, when training is performed entirely in low precision, without FP32 master weights or optimizer states, it typically underperforms compared to full-precision training. Standard additive optimizers like Adam often diverge in this regime, as small updates vanish below the mantissa resolution while large ones overflow the representable range.
We introduce M+Adam, an optimizer that enables stable, fully low-precision training by jointly applying additive and multiplicative updates. Each weight is represented as a mantissa–exponent pair, where Adam refines the mantissa and Madam adjusts the exponent. This dual-path update aligns the optimizer dynamics with floating-point structure: additive updates provide fine intra-bin control, while multiplicative updates traverse quantization bins.
Theoretically, we prove monotone descent under standard smoothness assumptions. Empirically, M+Adam trains LLaMA-style models in pure BF16 (no FP32 copies) and matches the perplexity of full-precision Adam across 60M–350M parameter scales, providing a practical step toward end-to-end low-precision optimization.
Submission Number: 141
Loading