Keywords: LoRA, low-rank adaptation, LLM, parameter-efficient fine-tuning
Abstract: Fine-tuning Large Language Models (LLMs) via Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting general-purpose models to specialized downstream tasks. Among these, Low-Rank Adaptation (LoRA) is widely adopted for its ability to reduce trainable parameters by utilizing a low-rank decomposition of weight updates. However, as the demand for deploying thousands of task-specific adapters on resource-constrained edge devices grows, the storage and memory overhead of standard LoRA remains a critical bottleneck. In this study, we propose Relayed LoRA, a novel algorithm designed to push the boundaries of parameter efficiency. Our approach introduces a secondary, "relayed" decomposition of the two standard LoRA matrices ($A$ and $B$) into a quad-matrix structure ($A_1, A_2, B_1, B_2$). By leveraging a fixed structural mapping, we decouple the representational rank from the total number of trainable parameters, enabling the system to maintain a high-rank update while significantly reducing the parameter footprint—often by an order of magnitude. Empirical evaluations on the GLUE benchmark demonstrate that Relayed LoRA achieves performance comparable to standard LoRA while utilizing substantially fewer parameters. Our results suggest that Relayed LoRA provides a scalable and efficient framework for large-scale multi-task deployment, offering a new paradigm for extreme parameter-efficient fine-tuning in memory-sensitive environments.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM fine-tuning, low-resource training for NLP
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 922
Loading