# Grouter

This repository contains the code and scripts for **Grouter**, a learned router for Mixture-of-Experts (MoE) language models. It supports distillation from a teacher model (e.g., Qwen3-30B-A3B), training and fine-tuning of the router, and communication-aware expert grouping for distributed MoE training.

This document provides step-by-step instructions for reviewers to reproduce the environment and run the main pipeline. All paths in the scripts are written for execution inside the provided Docker container; adjust `PROJECT_ROOT` and `DATA_ROOT` when creating the container to match your host layout.

---

## 1. Environment Setup

We use Docker to ensure a consistent environment. All steps below (except cloning Megatron) are intended to be run **inside the container** after it is started.

### 1.1 Build the Docker image

From the repository root:

```bash
export PROJECT_ROOT=/path/to/this/repo   # path to the Grouter repo on your machine
bash env/create_image.sh
```

This builds the image `grouter:latest` from the Dockerfile in `env/`.

### 1.2 Create and run the container

Set `PROJECT_ROOT` and `DATA_ROOT` so that your repo and datasets are mounted correctly. Then start the container:

```bash
export PROJECT_ROOT=/path/to/this/repo
export DATA_ROOT=/path/to/your/data   # parent dir for dsv2_c4_data, qwen3_c4_data, c4_data/en, grouter_predispatch
bash env/create_container.sh
```

Inside the container, the working directory is `/workspace/Megatron-LM-router`. The following directories are mounted:

- `general_router` ← `$PROJECT_ROOT` (this repo)
- `dataset` ← `$DATA_ROOT/dsv2_c4_data`
- `qwen3_dataset` ← `$DATA_ROOT/qwen3_c4_data`
- `c4_dataset` ← `$DATA_ROOT/c4_data/en`
- `grouter_predispatch` ← `$DATA_ROOT/grouter_predispatch`

All commands in Sections 2–6 are run from `/workspace/Megatron-LM-router` (or the equivalent path where `general_router` and `Megatron-LM` are available).

---

## 2. Download C4 dataset and Qwen3-30B-A3B model

- **C4 (Colossal Clean Crawled Corpus)**  
  Download the English C4 dataset and place it under the path that will be mounted as `/workspace/Megatron-LM-router/c4_dataset` (e.g. `$DATA_ROOT/c4_data/en`). The preprocessing script expects files named like `c4-train.XXXXX-of-01024.json.gz`.

- **Qwen3-30B-A3B**  
  Obtain the Qwen3-30B-A3B model and tokenizer and place them in a directory that will be available inside the container as `model_home/qwen3-30b-a3b`. Ensure the converted Megatron checkpoint (used for distillation) is available at `model_home/qwen3-30b-a3b-converted` as required by `scripts/distillation_qwen3.sh`.

---

## 3. Configure Megatron-LM

The `Megatron-LM/` directory in this repository contains **only the files modified for Grouter** (e.g. MoE router integration, Grouter hooks, and data/training utilities). It is not a full Megatron-LM tree.

To get a runnable Megatron-LM:

1. Clone the official [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) repository.
2. Check out commit `e7c55de9`:
   ```bash
   git checkout e7c55de9
   ```
3. Overwrite (or merge) the corresponding files in the cloned Megatron-LM tree with the contents of this repository’s `Megatron-LM/` directory. That is, replace the same paths as in `Megatron-LM/` (e.g. under `megatron/`, `model_provider.py`, `pretrain_gpt.py`, `tools/`, `utils_grouter/`) with the versions from this repo.

After this, the full Megatron-LM tree (with Grouter modifications) should reside at `Megatron-LM/` relative to the repo root, so that paths like `Megatron-LM/pretrain_gpt.py` and `Megatron-LM/tools/preprocess_data.py` are valid from the container working directory.

---

## 4. Data preprocessing

With the container running and C4 + Qwen3-30B-A3B tokenizer available, preprocess C4 using the Qwen3 tokenizer:

```bash
bash scripts/preprocess_data.sh
```

This uses `Megatron-LM/tools/preprocess_data.py` to tokenize the C4 data and write outputs under `/workspace/Megatron-LM-router/qwen3_dataset/` (e.g. `qwen3-c4-XXXX_text_document.bin` and `.idx`). Adjust the loop range in the script if you use a different subset of C4 shards.

---

## 5. Distill Grouter from the teacher model

Distill the router (Grouter) from the Qwen3-30B-A3B teacher using:

```bash
bash scripts/distillation_qwen3.sh
```

This script runs distributed distillation with the teacher checkpoint at `model_home/qwen3-30b-a3b-converted`, the tokenizer at `model_home/qwen3-30b-a3b`, and the preprocessed data under `qwen3_dataset`. It produces a distilled Grouter that can be used for MoE routing and for the following training step.

---

## 6. Train an MoE model with Grouter

To train a 650M-parameter MoE model with the distilled Grouter (router scoring and load-balancing):

```bash
bash scripts/650m_grt_ft_score.sh
```

This script trains the MoE model with Grouter-based routing. Checkpoints and logs are written to the path specified in the script (e.g. under `checkpoints/grt_650m_ft_score`). You can change `CHECKPOINT_PATH`, `DATA_PATH`, and other arguments inside the script to match your setup.

---

## 7. Advanced: Expert folding and expert tuning

If you need to train or evaluate models with a **different number of experts** than the one used during distillation, use **Expert Folding** (mapping) and **Expert Tuning**:

### 7.1 Obtain expert mapping (Expert Folding)

Run the expert mapping tool to get a mapping from the current expert configuration to the target number of experts (e.g. 64):

```bash
bash grouter_ep_optimizer/scripts/run_construct_mapping.sh
```

Important arguments in the script (and in `grouter_ep_optimizer/tools/construct_mapping.py`) include:

- `--target-num-experts`: target number of experts (e.g. 64)
- `--output-mapping`: path to the output mapping file (e.g. `grouter_ep_optimizer/grouter/qwen3_30b/cvt64_map_affinity.json`)
- `--grouter-config-path`, `--grouter-checkpoint-path`: Grouter config and checkpoint

Adjust `C4_HOME`, `DATA_BLEND`, and paths to match your container layout and data.

### 7.2 Expert tuning

Using the mapping file produced above, fine-tune the Grouter for the new expert configuration:

```bash
bash grouter_ep_optimizer/scripts/run_finetune_grouter.sh
```

Set `--grouter-config-path` to the **mapping config** (e.g. the generated `cvt32_mapping_affinity.json` or the same logical path for 64 experts) and `--grouter-checkpoint-path` to the Grouter checkpoint. The script uses `grouter_ep_optimizer/tools/finetune_grouter.py`; tune `--max-steps`, `--learning-rate`, and data paths as needed.

---

## 8. Advanced: Communication optimization (predispatch and expert grouping)

To use Grouter for **communication-aware expert placement** (e.g. reducing all-to-all cost via expert grouping), run predispatch and then cluster samples by expert usage.

### 8.1 Predispatch

Run predispatch to compute router scores (expert preferences) for your data:

```bash
bash grouter_ep_optimizer/scripts/run_predispatch.sh
```

This uses `grouter_ep_optimizer/tools/predispatch.py` and writes score files under the path given by `--output_prefix` (e.g. under `/workspace/Megatron-LM-router/grouter_predispatch/`). Point `--data_path` to your C4 (or other) JSONL shards and set `--grouter_ckpt` and `--grouter_config` to your trained Grouter.

### 8.2 Expert grouping

Using the predispatch outputs, run expert grouping to form communication-friendly expert groups:

```bash
bash grouter_ep_optimizer/scripts/run_cluster_samples.sh
```

This runs `grouter_ep_optimizer/run_cluster_samples.py` and uses the predispatch directory specified by `--predispatch-path`. Set `--num-nodes`, `--num-experts`, `--output-dir`, and other parameters to match your cluster and MoE configuration. The resulting grouping can be used for placement and communication optimization in your training or inference pipeline.

---

## 9. Summary of scripts

| Step | Script | Purpose |
|------|--------|---------|
| Environment | `env/create_image.sh` | Build Docker image |
| Environment | `env/create_container.sh` | Run container with mounted repo and data |
| Data | (manual) | Download C4 and Qwen3-30B-A3B |
| Megatron | (manual) | Clone Megatron-LM, checkout `e7c55de9`, overlay this repo’s `Megatron-LM/` |
| Preprocessing | `scripts/preprocess_data.sh` | Tokenize C4 with Qwen3 tokenizer |
| Distillation | `scripts/distillation_qwen3.sh` | Distill Grouter from Qwen3-30B-A3B |
| Training | `scripts/650m_grt_ft_score.sh` | Train 650M MoE with Grouter |
| Advanced | `grouter_ep_optimizer/scripts/run_construct_mapping.sh` | Expert mapping for different expert counts |
| Advanced | `grouter_ep_optimizer/scripts/run_finetune_grouter.sh` | Expert tuning with mapping |
| Advanced | `grouter_ep_optimizer/scripts/run_predispatch.sh` | Predispatch for communication optimization |
| Advanced | `grouter_ep_optimizer/scripts/run_cluster_samples.sh` | Expert grouping from predispatch |

---

## 10. Directory layout (overview)

- **`env/`** – Dockerfile and scripts to build and run the container.
- **`scripts/`** – Top-level pipeline: preprocessing, distillation, and 650M Grouter training.
- **`Megatron-LM/`** – Megatron-LM patches (must be overlaid on a full Megatron-LM checkout at commit `e7c55de9`).
- **`grouter_ep_optimizer/`** – Grouter and expert-placements utilities:
  - **`scripts/`** – Expert mapping, fine-tuning, predispatch, and clustering.
  - **`tools/`** – Entry points: `construct_mapping.py`, `finetune_grouter.py`, `predispatch.py`, etc.
  - **`grouter/`** – Grouter model and configs.

For questions or issues regarding this supplementary material, please refer to the main submission or the contact information provided there.
