How Low Can LoRA Go: System-Level Throughput, Energy, and Model Quality Tradeoffs when Fine-Tuning Adapters

Connor Espenshade; Umesh Deshpande; Yue Zhu; Eun Kyung Lee; Martha A Kim

How Low Can LoRA Go: System-Level Throughput, Energy, and Model Quality Tradeoffs when Fine-Tuning Adapters

Connor Espenshade, Umesh Deshpande, Yue Zhu, Eun Kyung Lee, Martha A Kim

Published: 21 May 2025, Last Modified: 17 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Presentation: In-Person

Keywords: Low Rank Adaptation; Energy; Performance Modeling; Fine-tuning; Parameter Efficient Fine-Tuning; LLM; Extractive Question Answer

Presenter Full Name: Connor Espenshade

TL;DR: Investigation of rank and system/model performance finding lower ranks offer equal quality at a fraction of the memory for the same performance

Presenter Email: cje2136@columbia.edu

Abstract: As models scale beyond trillions of parameters, extending their functionality is increasingly achieved through fine-tuning existing base models. However, fine-tuning all pa- rameters remains computationally expensive. Recent techniques such as Low-Rank Adaptation (LoRA) have been developed to reduce the number of trainable parameters. LoRA adapters have gained widespread adoption, but their effects on GPU system metrics, such as throughput and energy efficiency, are not yet well understood. In this study, we examine these system-level metrics as a func- tion of the LoRA adapter rank. Our findings show that reducing the rank of LoRA adapters does not lead to a significant drop in model quality, while simultaneously improving throughput, energy efficiency, and memory usage by up to 2.7x. Further, we find that the presence of a LoRA adapter, rather than its rank size, can greatly improve model quality compared to a zero- shot inference base model. This makes smaller LoRA adapters a compelling choice from both a system and a model quality perspective.

Presenter Bio: Connor Espenshade is a Computer Engineering senior at Columbia University. He has 6 publications ranging from AI to biology and astrophysics. His work in computer architecture research focuses on analysis and optimization of systems performance for time and energy.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

YouTube Link: https://youtu.be/KOYBOU8VOpE

YouTube Link Poster: N/A

Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.

Google Slides: https://docs.google.com/presentation/d/1nckIF2JfDSVDkLO_xIQHUdPbenKqApk7qJPfUZE7vhU/edit?usp=sharing

Poster: Yes

Workshop Registration: Yes, the presenter has registered for the workshop.

YouTube Link Short: [to come]

Submission Number: 20

Loading