# Pipeline

Minimal instructions for rerunning experiments.

## Setup

```bash
uv sync
cp .env.example .env
```

Set `HF_NAME`, `HF_TOKEN`, and `WANDB_API_KEY` in `.env`.

## Regenerate Self-Judge DPO Data

```bash
export HF_NAME=your-hf-username

CUDA_VISIBLE_DEVICES=1,2 uv run python score_judge.py \
  --dataset "$HF_NAME/chainsum_generations" \
  --output "$HF_NAME/chainsum_sjudge" \
  --max_model_len 8192 \
  --tensor_parallel_size 2

uv run python make_data.py dpo_pairs \
  --src "$HF_NAME/chainsum_sjudge" \
  --out "$HF_NAME/chainsum_sjudge_dpo"
```

## Run Full Strict Experiment

```bash
HF_NAME=your-hf-username \
bash run.sh 1,2 42
```

`run.sh` regenerates the entropy and self-judge datasets, trains the DPO and RRHF variants, and evaluates all four strict-condition runs.

## Key Hyperparameters

For DPO, `run.sh` uses `beta=5`, `learning_rate=5e-5`, `num_train_epochs=1`, `per_device_train_batch_size=1`, and `gradient_accumulation_steps=2`.

Evaluation uses a holdout set of 200 samples from the original dataset, which is not included in the training data.
