Keywords: quantization, efficiency
TL;DR: We identify that PTQ robustness reflects a complex interaction between LR dynamics and validation loss independently of training data size. We investigate different interventions that can have a favorable effect on PTQ robustness.
Abstract: Despite its widespread use, little is understood about what makes large language models more — or less — robust to quantization.
To address this question, we study the degradation induced by quantization in language modeling, analyzing open-source training trajectories of models up to 3 billion parameters and 11 trillion tokens, and validate our analysis by pretraining 160M-parameter models on up to 100B tokens. Our findings reveal that, contrary to previous work, post-training quantization robustness is driven by a complex interplay between learning rate decay and validation loss. In particular, as learning rate decays, validation loss and quantization error diverge, mostly independent of the amount of training data. Finally, we present two examples of interventions on the training dynamics that modulate quantization error, sometimes favorably. Namely, (1) for comparable validation loss, higher learning rates can lead to smaller quantization error; (2) weight averaging approximates learning rate decay favorably in some settings.
Submission Number: 178
Loading