Keywords: PEFT, LORA, MOE, LLM
Abstract: Low-Rank Adaptation (LoRA) has become the dominant paradigm for Parameter-Efficient Fine-Tuning (PEFT) of large language models. However, most prior work focuses on dense architectures. In contrast, Mixture-of-Experts (MoE) models—now a de facto standard—scale parameter counts while keeping per-token compute nearly constant, creating new challenges for LoRA: how to minimize trainable parameters and maximize fine-tuning throughput without sacrificing quality. We propose Hot-Experts Layer-level Low-Rank Adaptation ($\textbf{HELLoRA}$), a simple yet effective scheme that attaches LoRA modules only to the hot experts at each layer, i.e., those most frequently activated. This design sharply reduces the number of trainable parameters and boosts fine-tuning throughput, while, perhaps unexpectedly, improving downstream performance. To stress-test HELLoRA under extreme parameter budgets, we further introduce $\textbf{HELLoRI}$, an orthogonal composition of HELLoRA with the recent LoRA with Reduced Interference (LoRI). Across extensive experiments on code generation, mathematical reasoning, and safety alignment, HELLoRA consistently outperforms strong PEFT baselines. In particular, relative to vanilla LoRA, HELLoRA uses only $\textbf{15.74}$\% of the model parameters, improves accuracy by $\textbf{9.24}$\%, and achieves an $\textbf{88.80}$\% speedup. HELLoRI matches LoRA’s accuracy while training just $\textbf{0.7}$\% of LoRA’s parameters. These results suggest that focusing LoRA capacity on hot experts is a practical path to scaling PEFT for large MoE LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2851
Loading