# Merge-of-Thought Distillation (MoT) — README

> Lightweight multi-teacher **Merge-of-Thought (MoT)** distillation for long chain-of-thought (CoT) reasoning.
> This repo includes (1) data distillation from teacher models and (2) MoT training.

---

## 🗂️ Repository layout (minimal)

```
.
├─ data_distillation/
│  ├─ distill_16a.py           # Distill CoT data from teachers
│  └─ sever.sh                 # Starts teacher endpoints / inference servers
├─ run.sh                      # Main entry for MoT training
├─ README.md                   # This file
└─ requirements.txt            # (recommended) Python deps
```



---

## ⚙️ Environment

* Python ≥ 3.10
* CUDA toolkit compatible with your GPU drivers (e.g., CUDA 12.x for H800)
* Install Python dependencies:

  ```bash
  pip install -r requirements.txt
  ```
---

## 📥 Datasets

We use two public datasets:

* **BOBA-200**: [https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data](https://huggingface.co/datasets/inclusionAI/AReaL-boba-Data)
* **S1K-200**: [https://huggingface.co/datasets/simplescaling/s1K](https://huggingface.co/datasets/simplescaling/s1K)

You can download via `datasets` API or `huggingface-cli`:

**Expected layout (example):**

```
data/
├─ boba/
│  └─ boba-200.jsonl
└─ s1k/
   └─ s1k-200.jsonl
```

---

## 🧪 Step 1 — Data Distillation (teacher → CoT)

This step queries teacher model(s) to produce CoT rationales, then stores filtered/distilled samples for training.

1. **Start teacher inference**:

```bash
cd data_distillation
bash sever.sh
```

2. **Run distillation script**:

```bash
python distill_16a.py 
```
**Output:** a teacher-specific distilled file, e.g.

```
data/boba/distilled/
  ├─ qwq.jsonl
  ├─ qwen32b.jsonl
  ├─ qwen235b.jsonl
  └─ r1.jsonl
```

---

## 🧩 Step 2 — MoT Training (multi-teacher merge-of-thought)

MoT alternates **teacher-specific branch SFT** and a **weight-space merge** (simple averaging), repeated for multiple rounds.

### Quick start

From repo root:

```bash
bash ./run.sh
```

**What `run.sh` typically does:**

* Reads base model (e.g., `QWEN3-8B`, `QWEN3-14B`, `QWEN3-30B-A3B`)
* Loads multiple distilled corpora (e.g., QWQ / Qwen3-32B / Qwen3-235B / R1)
* Iterates:

  1. Branch SFT for each teacher on that teacher’s CoT (`Eq. (1)` in paper)
  2. Weight merge (avg) to form the new init (`Eq. (2)` in paper)
  3. Repeat for T rounds
* Saves checkpoints every **50 steps** (best-checkpoint selection is done in evaluation)

---

## 📊 Evaluation (AIME & others)

* We evaluate every **50 steps** and keep the **best checkpoint** (per AIME AVG).
* AIME24/25 results are reported as **16-run averages**.
* For generalization/forgetting analysis (e.g., CEVAL, MMLU variants, GPQA, LiveCodeBench, PhyBench), run your existing evaluation suite against the saved checkpoints.

---

## ✅ TL;DR — Commands

```bash
# 0) Install deps
pip install -r requirements.txt

# 1) Download datasets 

# 2) Start teacher endpoints
cd data_distillation
bash sever.sh

# 3) Distill (example: BOBA, teacher qwen32b)
python distill_16a.py 

# 4) Train MoT
cd ..
bash ./run.sh
```

---
