A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models

Mengyang Sun; Yihao Wang; Tao Feng; Dan Zhang; Yifan Zhu; Jie Tang

A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models

Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a stronger training strategy for Mixture of LoRAs based on improved Riemannian preconditioners, which boosts learning procedures and downstream performances.

Abstract: In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning by gate-rescaled multi-space projections. We provide both a theoretical solution as well as an alternative engineering strategy. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.

Lay Summary: Large AI models may need to be fine-tuned for new tasks. A popular fine-tuning technique called LoRA (Low-Rank Adaptation) makes fine-tuning more efficient using a pair of smaller and simpler modules. However, LoRA struggles with complex tasks due to its limited module size, and its training procedure is not optimal due to its dual-module structure. To address these, researchers propose an integration of multiple LoRA modules (MoE-LoRA) to process complex tasks; and some training behavior refiners (e.g. Riemannian Preconditioners) to adjust LoRA training procedure. However, there lacks a specific strategy for refining the training behavior of MoE-LoRA, especially considering the structure of multi-module integration is naturally more unrobust and harder for training. Based on Riemannian Preconditioners, we introduce a new method to stabilize and enhance the training behavior of MoE-LoRA by additionally weighing and adjusting how each LoRA module learns. The key idea is that the importance assigned to each LoRA expert should also influence how it updates, and should be integrated into the refiners. Our method helps MoE-LoRA learn more effectively, closing the gap between efficient fine-tuning and fully fine-tuning. Tests on different AI models and tasks confirm the approach works well and outperforms the baseline, offering a better way to adapt powerful AI models without excessive computational costs.

Link To Code: https://github.com/THUDM/MoELoRA_Riemannian

Primary Area: Deep Learning->Large Language Models

Keywords: Parameter-Efficient Fine-Tuning, Low-Rank Adaptation, Mixture of Experts, Foundation Models, Gradient Optimization, Riemannian Preconditioner

Submission Number: 8526

Loading