# Supplementary Code for "Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards"

> **Note:** This code corresponds to ICLR2026 Submission ID #14745. It is intended for reviewing purposes only.

This repository contains the official implementation for the paper, "Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards". Our work introduces **Routing-Optimized Group Relative Policy Optimization (RO-GRPO)**, a framework that integrates a mechanism-aware reward signal into the reinforcement learning fine-tuning (RFT) process to optimize the internal routing of LoRA-MoE models.

This codebase is built upon PyTorch and leverages the MS-Swift framework, with custom modifications to PEFT to implement the LoRA-MoE architecture.

## Code Structure

The repository is organized as follows:

```
.
├── README.md               # This file: Instructions for setup, training, and evaluation.
├── requirements.txt        # Python dependencies for setting up the environment.
├── train.sh                # Unified script to launch all training experiments.
├── eval.sh                 # Unified script to run unimodal and multimodal evaluations.
├── loramoe_layer.py        # The core implementation of our LoRA-MoE layer, modified from PEFT.
├── ro_grpo_trainer.py      # The RO-GRPO trainer, extending the base trainer with our core logic.
└── routing_reward.py       # Implementation of our two mechanism-aware reward functions.
```

## 1. Installation and Setup

### 1.1. Prerequisites
- Python 3.10+
- NVIDIA GPU with CUDA 11.8+
- We recommend using a virtual environment (e.g., `conda` or `venv`).

### 1.2. Install Dependencies

First, create and activate a virtual environment. Then, install the required packages using the provided `requirements.txt` file.

```bash
# Create and activate a virtual environment
python -m venv ro_grpo_env
source ro_grpo_env/bin/activate

# Install all base dependencies
pip install -r requirements.txt
```
**Note:** The `flash_attn` package may require specific compilation steps depending on your system's CUDA and PyTorch versions. If you encounter issues, please refer to the official [FlashAttention repository] for installation guidance.

### 1.3. Manual Code Integration (Crucial for Reproducibility)

Our implementation modifies core components of the `peft` and `ms-swift` libraries. To reproduce our results, you must manually integrate our custom code into the installed library files.

**First, locate the installation paths of the libraries:**
```bash
pip show peft
pip show ms-swift
```
This will display the `Location` of each library (e.g., `.../ro_grpo_env/lib/python3.10/site-packages`).

**Next, follow these integration steps:**

1.  **Integrate LoRA-MoE Layer into `peft`**:
    - **Target File**: Find the `layer.py` file within the `peft` library at `.../site-packages/peft/tuners/lora/layer.py`.
    - **Source File**: Our `loramoe_layer.py`.
    - **Action**: Open both files. In the target file (`peft/.../layer.py`), locate the `Linear` class. **Replace the entire `Linear` class definition** with the `Linear` class definition from our `loramoe_layer.py`. This modification is specific to the `Linear` layer and is essential for our LoRA-MoE implementation.

2.  **Integrate RO-GRPO Trainer into `ms-swift`**:
    - **Target File**: Find the `grpo_trainer.py` file within the `ms-swift` library at `.../site-packages/swift/trainers/grpo_trainer.py`.
    - **Source File**: Our `ro_grpo_trainer.py`.
    - **Action**: Our `ROGRPOTrainer` class inherits from the original `GRPOTrainer`. To integrate it, you need to ensure that the training process uses our custom trainer. The simplest way is to **add the `ROGRPOTrainer` class from our `ro_grpo_trainer.py` to the end of the target file** (`swift/.../grpo_trainer.py`). Then, modify the training script's entry point to import and use `ROGRPOTrainer` instead of `GRPOTrainer`.

3.  **Register Custom Reward Functions in `ms-swift`**:
    - **Target File**: Locate the central plugin registration file in `ms-swift`, which is typically `.../site-packages/swift/llm/plugin.py`.
    - **Source File**: Our `routing_reward.py`.
    - **Action**: Add the code from the source file to the end of the target file.

After completing these steps, your environment will be correctly set up to run our experiments.

## 2. Data and Models Preparation

Before running the experiments, please download the necessary datasets and models and place them in the specified directory according to the settings of the script.


- **Models**: Download the `Qwen2.5-7B-Instruct` and `Qwen2.5-VL-7B-Instruct` models from the Hugging Face Hub.
- **Datasets**:
    - The `NuminaMath-TIR` dataset for unimodal training will be downloaded automatically by `ms-swift`.
    - The `Geometry3k` dataset for multimodal training needs to be downloaded manually and placed in the specific directory.

## 3. Training

The `train.sh` script is the unified entry point for launching all training experiments described in the paper.

### Usage

The script is executed as follows:
```bash
bash train.sh [task_type] [reward_type]
```
- `task_type`: Specifies the task. Options: `unimodal`, `multimodal`.
- `reward_type`: Specifies the training method. Options correspond to the experiments in our paper:
    - `ro_grpo_smooth`: Our proposed method with curriculum-based reward scheduling.
    - `ro_grpo_relative`: Our proposed method with relative improvement gating.
    - `lora_moe_baseline`: The LoRA-MoE baseline trained with standard GRPO.
    - `lora_baseline`: The standard LoRA baseline.

### Example Commands

Here are the commands to reproduce the main experiments from our paper:

**Unimodal Training (on NuminaMath-TIR)**
```bash
# Train with RO-GRPO (Smooth)
bash train.sh unimodal ro_grpo_smooth

# Train the LoRA-MoE baseline
bash train.sh unimodal lora_moe_baseline
```

**Multimodal Training (on Geometry3k)**
```bash
# Train with RO-GRPO (Relative)
bash train.sh multimodal ro_grpo_relative

# Train the LoRA-MoE baseline
bash train.sh multimodal lora_moe_baseline
```

All trained checkpoints and logs will be saved to the `./output/[task_type]/[reward_type]/` directory.

## 4. Evaluation

The `eval.sh` script is the unified entry point for evaluating the trained checkpoints. It uses **OpenCompass** for unimodal tasks and **VLMEvalKit** for multimodal tasks.

### Before You Start

You must configure the `eval.sh` script before running it:
1.  **Set Checkpoint Paths**: Open `eval.sh` and update the `UNIMODAL_PEFT_PATHS` and `MULTIMODAL_PEFT_PATHS` arrays with the correct paths to the checkpoints you want to evaluate. The script is pre-configured to look in the `./output/` directory.
2.  **Set API Key (for Unimodal Evaluation)**: For unimodal evaluation, OpenCompass requires a powerful LLM (e.g., GPT-4) as a judge. Set your `OC_JUDGE_API_KEY` and other related environment variables at the top of the `eval.sh` script.

### Usage

The script is executed as follows:
```bash
bash eval.sh [task_type]
```
- `task_type`: Specifies the evaluation task. Options: `unimodal`, `multimodal`.

### Example Commands

**Unimodal Evaluation**
```bash
bash eval.sh unimodal
```

**Multimodal Evaluation**
```bash
bash eval.sh multimodal
```

Evaluation results will be saved to the directories configured within the `eval.sh` script (by default, under `./outputs/evaluation/`).

