Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization—mapping, threshold selection, and precision assignment—while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance–precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.
Large language models (LLMs) pack hundreds of billions of parameters, so even “lightweight” frameworks to adapt LLMs to downstream tasks (e.g., LoRA or QLoRA) still strain GPU memory. LowRA squeezes each parameter to about 2 bits—over 15× smaller than the 32-bit norm—while keeping accuracy nearly intact. It learns quantization encoders/decoders specific to each slice of parameters, assigns 1-/2-/4-bit budgets with a fast optimizer, and dequantizes on the fly with lightweight CUDA kernels, so there’s virtually no runtime cost. On four mainstream LLMs and benchmarks, LowRA beats existing quantizers above 2 bits and still works down to 1.15 bits, cutting memory by up to 50 percent. This unlocks personalized fine-tuning on laptops, phones, and other edge devices that previously couldn’t handle such large models.