# T3S Token Masking + Training Pipeline

This repo provides a minimal workflow to:
1) compute per-token confidence on two checkpoints,
2) derive a token mask and write masked training data, and
3) train on the masked dataset with **LLaMA-Factory**.

---

## 1) Environment & Prerequisites

- Use the same Python environment for `sglang` and the masking scripts.
- Before running, check paths inside these scripts:
  - `before_sft.py` (base/earlier checkpoint confidence)
  - `after_sft.py`  (later/distilled checkpoint confidence)
  - `mask_data.py`  (generate masked JSONL)

Typical things to verify:
- model/checkpoint paths
- input dataset JSONL path
- output directory for confidence files and masked data
- sglang endpoint host/port

---

## 2) Token Masking Pipeline

### Step 1: Start the sglang service

Start the sglang server (use whichever one you actually have):

- If you have a notebook: run `server.ipynb` (or your notebook file) to launch the service.
- If you have a script: run `server.py` (or your script file) to launch the service.

> Make sure the printed host/port matches what `before_sft.py` / `after_sft.py` are using.

### Step 2: Compute confidence for two checkpoints

Run:

```bash
python before_sft.py
python after_sft.py
````

* `before_sft.py`: computes per-token confidence using the **base / earlier** checkpoint.
* `after_sft.py`: computes per-token confidence using the **later / distilled** checkpoint.

These outputs are used to compute token-wise changes (e.g., delta prob / delta confidence) for masking.

### Step 3: Generate masked dataset

After both confidence files are produced, run:


python mask_data.py


This should produce a masked training JSONL that contains (at minimum):

* `messages`
* `mask_input_id`
* `mask_label_id`

---

## 3) Training with LLaMA-Factory

### Step 1: Register the masked dataset

Add an entry in:

`LLaMA-Factory-main/data/dataset_info.json`

Example (replace `xxx.jsonl` with your generated file path/name):


"t3": {
  "file_name": "xxx.jsonl",
  "formatting": "sharegpt",
  "columns": {
    "messages": "messages",
    "mask_input_id": "mask_input_id",
    "mask_label_id": "mask_label_id"
  },
  "tags": {
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant",
    "system_tag": "system"
  }
}

### Step 2: Start training

Run: bash t3_train.sh

Training should load the dataset above and use `mask_input_id` / `mask_label_id` for the masked objective.

---

## Notes / Common Issues

* **Server not reachable:** if `before_sft.py` / `after_sft.py` fails, verify the sglang server is running and the endpoint config matches.
* **Dataset format mismatch:** ensure the masked JSONL is ShareGPT-style with a top-level `messages` list, and includes `mask_input_id` and `mask_label_id`.
* **Path issues:** keep file paths consistent across scripts and `dataset_info.json`.

---

## Quick Checklist

* [ ] sglang server started successfully (host/port correct)
* [ ] `before_sft.py` produced confidence output
* [ ] `after_sft.py` produced confidence output
* [ ] `mask_data.py` produced masked JSONL with `messages/mask_input_id/mask_label_id`
* [ ] added `"t3": {...}` to `dataset_info.json`
* [ ] `bash t3_train.sh` runs without dataset loading errors

