# One Billion Word Benchmark (LM1B) — Pretraining and MoE Fine-Tuning

This section outlines the end-to-end pipeline for training and analyzing Mixture-of-Experts (MoE) models on the [One Billion Word Benchmark](https://www.statmt.org/lm-benchmark/), using a GPT-2 architecture and JAX-based training.

---

## 1. Download and Preprocess LM1B Dataset

To prepare the LM1B dataset, run:

```bash
bash src/lm1b/data.sh
```

This script downloads and preprocesses the LM1B dataset into a format compatible with the training scripts.

---

## 2. Train a Baseline GPT-2 Model

To train a standard GPT-2 model on LM1B:

```bash
CUDA_VISIBLE_DEVICES=0,1 python src/lm1b/train_model.py \
    --random-seed 0 \
    --model-config-name gpt2 \
    --batch-size 48 \
    --max_step 500000 \
    --model-save-dir /root/weights/lm1b \
    --data-path ./data/lm1b
```

---

## 3. Fine-tune with Mixture-of-Experts (MoE)

To fine-tune the baseline model with an MoE layer inserted:

```bash
CUDA_VISIBLE_DEVICES=0,1 python src/lm1b/finetune_moe.py \
    --model-config-name gpt2 \
    --batch-size 48 \
    --model-path /root/weights/lm1b/TrainModel-lr0.00025-step500000-size48/best_500000 \
    --data-path ./data/lm1b \
    --random-seed 0 \
    --moe-layer-indices 0 \
    --num-shared-experts 0 \
    --num-routed-experts 4 \
    --topk 4 \
    --model-save-dir /root/weights/lm1b/finetune48
```

---

## 4. Expert Matching Between MoE Models

To align expert indices across two independently fine-tuned models:

```bash
CUDA_VISIBLE_DEVICES=0 python src/lm1b/expert_matching.py \
    --model-a /root/weights/lm1b/finetune48/lr0.00025-topk2-shared0-routed16-batch48-seed0/best_72000 \
    --model-b /root/weights/lm1b/finetune48/lr0.00025-topk2-shared0-routed16-batch48-seed20/best_60000 \
    --data-path ./data/lm1b
```

---

## 5. Visualize Linear Mode Connectivity (LMC)

To generate LMC plots between matched models:

```bash
python src/lm1b/plot.py \
    --file-1 results/lm1b/[lr0.00025-topk2-shared0-routed16-batch48-seed0+lr0.00025-topk2-shared0-routed16-batch48-seed20].json \
    --file-2 results/lm1b/[lr0.00025-topk2-shared0-routed16-batch48-seed0+lr0.00025-topk2-shared0-routed16-batch48-seed40].json \
    --file-3 results/lm1b/[lr0.00025-topk2-shared0-routed16-batch48-seed20+lr0.00025-topk2-shared0-routed16-batch48-seed40].json \
    --output-dir plots/lm1b/[lr0.00025-topk2-shared0-routed16-batch48]
```

This will produce visualizations that illustrate the degree of connectivity between independently trained MoE models in parameter space.

---
