Abstract: Although recent quantized Large Language Models, such as BitNet, have paved the way for
significant reduction in memory usage during deployment with binary or ternary weights,
training these models still demands substantial memory footprints. This is partly because
high-precision (i.e., unquantized) weights required for straight-through estimation must be
maintained throughout the whole training process. To address this, we explore directly
updating the quantized low-precision weights without relying on straight-through estima-
tion during backpropagation, aiming to save memory usage during training. Specifically,
we employ a stochastic rounding technique to minimize the information loss caused by the
use of low-bit weights throughout training. Experimental results on our LLaMA-structured
models of various sizes indicate that (1) training with only low-precision weights is feasible
even when they are constrained to ternary values; (2) extending the bit width to 8 bits
achieves performance on par with BitNet b1.58; (3) our models remain robust to precision
scaling and memory reduction, showing minimal performance degradation when moving
from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support
inference using ternary weights, showcasing their flexibility in deployment.
Supplementary Material: pdf
Submission Number: 278
Loading