LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Low-rank adaptation, Memory-efficient, Large language models, Fine-tuning.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation.
Abstract: The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs) and it becomes a very common technique in fine-tuning LLMs. However, during fine-tuning, it still requires very expensive activation memory to update low-rank weights. Though there exist studies trying to reduce the storage of activations, they either would sacrifice model performance or take much longer time for fine-tuning models. To this end, we propose a memory-efficient fine-tuning method, named LoRA-FA, that significantly reduces the activation memory without performance degradation and extra computational costs. Specifically, LoRA-FA freezes the projection-down weight of $A$ and updates the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space as like LoRA to preserve the fine-tuning performance while eliminating the requirement to store full-rank input activations so as to reduce the overall memory consumption. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA always preserves the fine-tuning accuracy across different tasks and it reduces the overall memory costs by up to 4$\times$ and 1.4$\times$ compared to full-parameter fine-tuning and LoRA, respectively. Furthermore, LoRA-FA is also compatible with other advanced memory optimization methods like FlashAttention, QLoRA, and ZeRO.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3536
Loading