LoRA Learns Less and Forgets Less

Published: 13 Aug 2024, Last Modified: 13 Aug 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\approx$100K prompt-response pairs) and continued pretraining ($\approx$20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100$\times$ greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.
Certifications: Featured Certification
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Addressing all editor and reviewers comments, and integrating new results. The most notable additions include: - New Tülu-v2-mix experiment with associated evaluation metrics, including MT-bench. - Reran _all_ LoRA experiments (4 datasets) with equal hyparparameters including $\alpha=2r$, enabling us to achieve more overlap between full finetuning and LoRA and to cleanly compare math and code. - Added a new rank $r=64$; created cleaner learning-forgetting Pareto plots that visualize training duration. - Added a supplementary figure for k versus pass@k, up to k=256 - Added detailed training and evaluation protocols - Linked to a repository where we are uploading the model checkpoints. - Analysis of the effects of LoRA $\alpha$ - New analysis of LoRA throughput and memory compared to full finetuning
Code: https://github.com/danbider/lora-tradeoffs
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 2693
Loading