Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth

ICLR 2026 Conference Submission25339 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fine-tuning, Mixed Precision, LoRA, Adaptive rank, Multi-objective optimization
TL;DR: we propose QR-Adaptor, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance.
Abstract: As large language models (LLMs) scale up, model compression is crucial for their deployment on resource-constrained devices. While methods like QLoRA reduce resource demands by combining parameter quantization with LoRA fine-tuning, their use of uniform precision can limit performance by failing to account for layer-wise variations in parameter sensitivity. Recent advances have explored dynamic mixed-precision quantization and adaptive LoRA ranks, but these strategies are typically optimized in isolation. The synergistic integration of these two dimensions remains an unresolved core challenge. To address this, we introduce **QR-Adaptor**, a unified, gradient-free framework that jointly optimizes the per-layer quantization bit-width and LoRA rank. Instead of indirectly minimizing quantization error, QR-Adaptor formulates the task as a discrete, multi-objective optimization problem, directly guided by downstream task performance and memory constraints using a small calibration dataset. Our extensive experiments show that QR-Adaptor consistently establishes a new Pareto frontier, outperforming state-of-the-art quantized fine-tuning methods. Notably, our approach can surpass the performance of a 16-bit LoRA fine-tuned model while operating with a memory footprint comparable to 4-bit models.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 25339
Loading