Low Rank Quantization-Aware Training for LLMs

Yelysei Bondarenko; Riccardo Del Chiaro; Markus Nagel

Low Rank Quantization-Aware Training for LLMs

Yelysei Bondarenko, Riccardo Del Chiaro, Markus Nagel

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: transformers, LLM, quantization, quantization-aware training, QAT, low-rank adaptation, PEFT, memory efficiency, inference efficiency

TL;DR: We propose a lightweight and memory-efficient quantization-aware training (QAT) algorithm for LLMs.

Abstract: In this paper we propose LR-QAT – a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing performance: (a) low-rank quantization-aware reparameterization; (b) downcasting operation using fixed-point or double-packing and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pre-training framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) is orthogonal to most of recent PTQ methods and thus can be seamlessly combined with them. We apply LR-QAT to the LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms most of recent LLM quantization approaches and reaches the same model performance as full model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB memory.

Submission Number: 36

Loading