LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Zikai Zhou; Qizheng Zhang; Hermann Kumbong; Kunle Olukotun

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce LowRA to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss, cutting memory use by up to 50%.

Abstract:

Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization—mapping, threshold selection, and precision assignment—while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance–precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

Lay Summary:

Large language models (LLMs) pack hundreds of billions of parameters, so even “lightweight” frameworks to adapt LLMs to downstream tasks (e.g., LoRA or QLoRA) still strain GPU memory. LowRA squeezes each parameter to about 2 bits—over 15× smaller than the 32-bit norm—while keeping accuracy nearly intact. It learns quantization encoders/decoders specific to each slice of parameters, assigns 1-/2-/4-bit budgets with a fast optimizer, and dequantizes on the fly with lightweight CUDA kernels, so there’s virtually no runtime cost. On four mainstream LLMs and benchmarks, LowRA beats existing quantizers above 2 bits and still works down to 1.15 bits, cutting memory by up to 50 percent. This unlocks personalized fine-tuning on laptops, phones, and other edge devices that previously couldn’t handle such large models.

Primary Area: General Machine Learning->Hardware and Software

Keywords: Quantization, LoRA, PEFT

Submission Number: 13809

Loading