Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards

Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards

ICLR 2026 Conference Submission14745 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models (LLMs), Reinforcement Learning from Human Feedback (RLHF), Mixture-of-Experts (MoE), Parameter-Efficient Fine-Tuning (PEFT), Group Relative Policy Optimization (GRPO)

TL;DR: We introduce RO-GRPO, a method that prevents routing collapse in MoE models during GRPO by transforming internal routing statistics into a reward signal, enabling the simultaneous alignment of a model's behavior and its internal mechanisms.

Abstract: Parameter-efficient Mixture-of-Experts (MoE) architectures, such as LoRA-MoE, enable strong and generalizable fine-tuning. However, a critical problem arises when fine-tuning these architectures with advanced reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO). Traditional supervised techniques, such as auxiliary losses, are incompatible with the GRPO process, while the external task reward is blind to the internal routing mechanism. This disconnect leads to routing collapse and severe underutilization of MoE adapter parameters. To resolve this disconnect, we introduce Routing-Optimized Group Relative Policy Optimization (RO-GRPO), a mechanism-aware framework. It turns internal expert routing statistics collected during training into a direct reward signal, seamlessly integrating routing supervision into the reinforcement fine-tuning (RFT) process. This enables effective optimization of parameter utilization and improves performance on both unimodal and multimodal mathematical reasoning tasks, all without extra training stages. Our work provides the first demonstration that a scalar reward in GRPO can be engineered from a model's own internal mechanics to explicitly guide their optimization, extending alignment from mere behavior tuning to holistic mechanism alignment.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14745

Loading