From Weight-Based to State-Based Fine-Tuning: Further Memory Reduction on LoRA with Parallel Control

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We further reduce the memory cost of LoRA methods by introducing parallel control, which allows training 7B/8B models on GPUs like 3090 or 4090.
Abstract: The LoRA method has achieved notable success in reducing GPU memory usage by applying low-rank updates to weight matrices. Yet, one simple question remains: can we push this reduction even further? Furthermore, is it possible to achieve this while improving performance and reducing computation time? Answering these questions requires moving beyond the conventional weight-centric approach. In this paper, we present a state-based fine-tuning framework that shifts the focus from weight adaptation to optimizing forward states, with LoRA acting as a special example. Specifically, state-based tuning introduces parameterized perturbations to the states within the computational graph, allowing us to control states across an entire residual block. A key advantage of this approach is the potential to avoid storing large intermediate states in models like transformers. Empirical results across multiple architectures—including ViT, RoBERTa, LLaMA2-7B, and LLaMA3-8B—show that our method further reduces memory consumption and computation time while simultaneously improving performance. Moreover, as a result of memory reduction, we explore the feasibility to train 7B/8B models on consumer-level GPUs like Nvidia 3090, without model quantization. The code is available at an anonymous GitHub repository
Lay Summary: Are LoRA methods only about low-rank updates to model weights? We challenge this foundational view in this paper. Inspired by ideas from control theory—where systems are often adjusted by changing their internal states rather than their structure—we propose a new approach. Instead of updating the model’s weights, we update the information (or “states”) flowing through the model as it makes predictions. You can think of this like rerouting the flow of water through a system of pipes, rather than replacing the pipes themselves. This method treats the model as a computational graph and introduces controlled tweaks (called "parameterized perturbations") to how the data flows through this graph. Our approach includes LoRA as a special case, but goes further by allowing us to control larger parts of the model, like entire residual blocks. One major benefit is improved memory efficiency. Since we don’t need to store as many large intermediate values during training, we can now train big models (like those with 7 to 8 billion parameters) on consumer-grade GPUs such as the Nvidia 3090—without quantization.
Primary Area: Deep Learning->Algorithms
Keywords: Low-Rank Adaptation, Parallel Control, Parameter-Efficient Fine-Tuning
Submission Number: 643
Loading