Keywords: quantization, large language model
TL;DR: quantization schemes specifically for large language models
Abstract: Transformer based LLMs have shown to have impressive capabilities across various Natural Language
Processing (NLP) tasks (Chen et al. [2023], Wang et al. [2023], Thirunavukarasu et al. [2023]). Such
models, however, have significant computational overheads to their memory size. For example, the
Llama Touvron et al. [2023] model can have 8 billion up to 405 billion parameters, with future
models being even larger (Kaplan et al. [2020]). With a datatype of 16-bit floating point weights, such
models can require 16 GB up to 810 GB of memory storage. Additionally, floating-point arithmetic
operations require significantly more computational power than integer arithmetic. Therefore, weight
quantization methods have been proposed (Deng et al. [2020]) for hardware deployment, which
for example quantize the model weights down to 8-bit integers, or in the most extreme case, 1-bit
weights. With quantization, the memory footprint of the LLM model is smaller, and the computational
overhead is reduced which can have significant impact both for the speed and energy costs for very
big models. However, this method comes at a cost of a drop in the model inference accuracy.
Furthermore, additional scaling parameters S, Z (elaborated in Section 2) have to be kept track of, in
order to prevent exploding datatype sizes. Currently, there are several popular quantization schemes,
namely Post Training Dynamic Quantization (PTDQ), Post Training Static Quantization (PTSQ),
Quantization Aware Training (QAT).
Submission Number: 44
Loading