Keywords: Efficient Inference Methods, Quantization, Numerical Formats, Low Precision, Compression
Abstract: We propose a framework for systematic design and analysis of quantisation formats. Our objective of minimising the KL divergence between the original and quantised model outputs aligns with minimising the squared quantisation error of the model parameters. Guided by classical quantisation theory, we therefore develop and evaluate squared-error-optimal formats for known distributions. Uniform quantisation followed by lossless compression with a variable-length code is shown to be optimal. However, we find that commonly used block formats and sparse outlier formats also outperform fixed-length codes, implying they also exploit variable-length encoding. Finally, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when tested with direct-cast quantisation of language models.
Submission Number: 69
Loading