# WikiText-103 Preprocessing and MoE Fine-Tuning

This section describes the end-to-end pipeline for preparing and training language models on the [WikiText-103](https://huggingface.co/datasets/wikitext) dataset using the [jax-lm-training](https://github.com/codertimo/jax-lm-training) framework. The process includes preprocessing, base model training, Mixture-of-Experts (MoE) fine-tuning, expert matching, and visualization of linear mode connectivity (LMC).

---

## 1. Preprocess the WikiText-103 Dataset

To preprocess the dataset using the GPT-2 tokenizer:

```bash
python src/wikitext103/preprocess.py \
    --tokenizer-model "gpt2" \
    --min-sequence-length 128 \
    --max-sequence-length 256 \
    --num-special-token-reserved 2 \
    --ignore-label -100 \
    --stride 128 \
    --dataset-name "wikitext" \
    --dataset-sub-name "wikitext-103-v1" \
    --dataset-split-type "train" \
    --output-path "data/wikitext103/wikitext.train" \
    --direct_running_mode "multi_threading" \
    --direct_num_workers 1
```

---

## 2. Train a Baseline GPT-2 Model
To train a standard GPT-2 model from scratch or from pretrained weights:
```bash
CUDA_VISIBLE_DEVICES=0 python src/wikitext103/train_model.py \
    --model-config-name "gpt2" \
    --train-dataset-paths data/wikitext103/wikitext.train-00000-of-00001 \
    --eval-dataset-paths data/wikitext103/wikitext.validation-00000-of-00001 \
    --batch-size 16 \
    --random-seed 0 \
    --max-sequence-length 256 \
    --num-epochs 10 \
    --learning-rate 3e-5 \
    --dtype float32 \
    --logging-frequency 100 \
    --eval-frequency 5000 \
    --save-frequency 5000 \
    --model-save-dir /root/weights/wikitext103/pretrained
```
---
## 3. Impact of Feedforward Reinitialization on Pretrained Transformer Performance

To evaluate the effect of reinitializing the Feedforward Network (FFN) in each Transformer layer, run the following script:
```bash
CUDA_VISIBLE_DEVICES=0 python src/wikitext103/layer_replace_init.py \
    --model-path /root/weights/wikitext103/pretrained/checkpoint_322089 \
    --train-dataset-paths ./data/wikitext103/wikitext.validation-00000-of-00001  \
    --eval-dataset-paths ./data/wikitext103/wikitext.test-00000-of-00001 
```
---
## 4. Fine-tune with Mixture-of-Experts (MoE)

To fine-tune the pretrained model with an MoE-enhanced architecture:

```bash
CUDA_VISIBLE_DEVICES=0 python src/wikitext103/finetune_moe.py \
    --model-path /root/weights/wikitext103/pretrained/checkpoint_322089 \
    --moe-layer-indices 0 \
    --num-shared-experts 1 \
    --num-routed-experts 4 \
    --topk 2 \
    --seed 0 \
    --train-dataset-paths data/wikitext103/wikitext.train-00000-of-00001 \
    --eval-dataset-paths data/wikitext103/wikitext.validation-00000-of-00001 \
    --batch-size 16 \
    --max-sequence-length 256 \
    --num-epochs 10 \
    --learning-rate 3e-5 \
    --dtype float32 \
    --logging-frequency 100 \
    --eval-frequency 5000 \
    --save-frequency 5000 \
    --model-save-dir /root/weights/wikitext103/finetune/
```

---

## 5. Expert Matching Between Models

To align expert indices across two independently fine-tuned MoE models:

```bash
CUDA_VISIBLE_DEVICES=0 python src/wikitext103/expert_matching.py \
    --model-a /root/weights/wikitext103/fintune/lr3e-05-topk2-shared0-routed2-seed0/checkpoint_322089 \
    --model-b /root/weights/wikitext103/fintune/lr3e-05-topk2-shared0-routed2-seed20/checkpoint_322089 \
    --train-dataset-paths ./data/wikitext103/wikitext.validation-00000-of-00001 \
    --eval-dataset-paths ./data/wikitext103/wikitext.test-00000-of-00001
```

---

## 6. Visualize Linear Mode Connectivity (LMC)

To generate LMC plots from the interpolation between expert-aligned models:

```bash
python src/wikitext103/plot.py \
    --file-1 results/wikitext103/[lr3e-05-topk2-shared0-routed2-seed0+lr3e-05-topk2-shared0-routed2-seed20].json \
    --file-2 results/wikitext103/[lr3e-05-topk2-shared0-routed2-seed0+lr3e-05-topk2-shared0-routed2-seed40].json \
    --file-3 results/wikitext103/[lr3e-05-topk2-shared0-routed2-seed20+lr3e-05-topk2-shared0-routed2-seed40].json \
    --output-dir plots/wikitext103/[lr3e-05-topk2-shared0-routed2]
```

---

