[Proposal-ML] An Empirical Analysis on Quantization Schemes for Large Language Models

30 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: quantization, large language model
TL;DR: quantization schemes specifically for large language models
Abstract: Transformer based LLMs have shown to have impressive capabilities across various Natural Language Processing (NLP) tasks (Chen et al. [2023], Wang et al. [2023], Thirunavukarasu et al. [2023]). Such models, however, have significant computational overheads to their memory size. For example, the Llama Touvron et al. [2023] model can have 8 billion up to 405 billion parameters, with future models being even larger (Kaplan et al. [2020]). With a datatype of 16-bit floating point weights, such models can require 16 GB up to 810 GB of memory storage. Additionally, floating-point arithmetic operations require significantly more computational power than integer arithmetic. Therefore, weight quantization methods have been proposed (Deng et al. [2020]) for hardware deployment, which for example quantize the model weights down to 8-bit integers, or in the most extreme case, 1-bit weights. With quantization, the memory footprint of the LLM model is smaller, and the computational overhead is reduced which can have significant impact both for the speed and energy costs for very big models. However, this method comes at a cost of a drop in the model inference accuracy. Furthermore, additional scaling parameters S, Z (elaborated in Section 2) have to be kept track of, in order to prevent exploding datatype sizes. Currently, there are several popular quantization schemes, namely Post Training Dynamic Quantization (PTDQ), Post Training Static Quantization (PTSQ), Quantization Aware Training (QAT).
Submission Number: 44
Loading