QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

ACL ARR 2025 May Submission4475 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method avoids the low-precision straight-through estimator, which requires backward computation, and instead utilizes optimized stochastic rounding to mitigate increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in ${\rm FP}8$ and superior accuracy in ${\rm INT}8$ and ${\rm INT}4$ training. Experiments demonstrate that QuZO achieves competitive performance on classification, multi-choice, and generation tasks under low-bit training, including zero-shot reasoning tasks. Notably, QuZO incurs minimal overhead and reduces memory consumption by $2.94 \times$–$5.47 \times$ compared to quantized first-order methods during LLaMA-7B fine-tuning.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Quantization, PEFT method

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 4475

Loading