# RACO

RACO is a **reward-free alignment fine-tuning method for multi-objective preference optimization**.

This repository supports fine-tuning recent **Qwen3**, **Gemma3**, and **Llama3** checkpoints, and includes:

- **Training** code under `trl/` (TRL-based).
- **Evaluation** scripts under `eval/` (local judge models + vLLM-based generation).
- End-to-end runner scripts under `script/`.

## Repo layout

- `trl/`: training code + configs (Accelerate/TRL).
- `script/`: entrypoints, including `run_and_eval.sh` and `eval_gpt5.py`.
- `eval/`: evaluation utilities and scoring scripts.
- `trl/data/`: prompt / eval JSONs (e.g. `trl/data/gpt5-eval.json`).

## Setup

This repo uses **two Python environments**:

- `trl/env/` for training
- `eval/env/` for evaluation

### Training environment

```bash
cd trl
python3.12 -m venv env
source env/bin/activate
python -m pip install -U pip
python -m pip install -r requirements.txt
python -m pip install -e .
```

### Evaluation environment

```bash
cd eval
python3.12 -m venv env
source env/bin/activate
python -m pip install -U pip
python -m pip install -r requirements.txt
```

## Run: train + eval sweep (`script/run_and_eval.sh`)

Open `script/run_and_eval.sh` and fill the required paths:

- **Environments / code**
  - `TRL_DIR` (defaults to `../trl`)
  - `TRL_VENV` (e.g. `../trl/env/bin/activate`)
  - `EVAL_VENV` (e.g. `../eval/env/bin/activate`)
- **Model + datasets**
  - `BASE_MODEL`: local model path
  - `TRAIN_DATASET`: training dataset path
  - `VAL_DATASET`: validation dataset path
- **Outputs**
  - `OUTPUT_ROOT`: where fine-tuned checkpoints will be written
  - `LOG_DIR`: where logs will be written
  - `EVAL_OUTPUT_DIR`: where eval generations/scores will be written (used by the script)
- **Choose task + judges**
  - Safety (Beaver): `BEAVER_REWARD_MODEL_DIR`, `BEAVER_COST_MODEL_DIR`, `BEAVER_EVAL_PROMPTS`
  - Summarization (RedditSummary): `SUMMARY_QUALITY_MODEL_DIR`, `SUMMARY_FAITHFUL_MODEL_DIR`, `EVAL_PROMPTS`

Then run:

```bash
bash script/run_and_eval.sh
```

### Switch eval mode

`run_and_eval.sh` supports two evaluation modes (default is `beaver`):

```bash
EVAL_MODE=summary bash script/run_and_eval.sh
EVAL_MODE=beaver  bash script/run_and_eval.sh
```

### Model stop tokens (important)

When generating with vLLM, use the right stop token for your base model family:

- **Qwen3**: `<|im_end|>`
- **Gemma3**: `<end_of_turn>`
- **Llama3**: `<|eot_id|>`

The stop token is set inside `script/run_and_eval.sh` in the eval command.

### Sweep controls (optional)

`run_and_eval.sh` sweeps over:

- weights (positional args; defaults are in `DEFAULT_WEIGHTS`)
- learning rates (override with `LRS_OVERRIDE`)
- RACO `c` values (override with `CS_OVERRIDE`)
- clip lambdas (override with `CLIP_LAMBDAS_OVERRIDE`)

Examples:

```bash
# Pass custom weight specs (either "wq,wv" or "wq" where wv=1-wq)
bash script/run_and_eval.sh "0.8,0.2" "0.5,0.5"

# Override learning rates and c-sweep
LRS_OVERRIDE="1e-5 2e-5" CS_OVERRIDE="0.25 0.5" bash script/run_and_eval.sh

# Skip eval (only train)
SKIP_EVAL=1 bash script/run_and_eval.sh
```

## Safety alignment: LLM-as-a-judge with GPT-5 (`script/eval_gpt5.py`)

For safety alignment, you can evaluate two local models with a GPT judge.

1) Activate the eval environment

```bash
source eval/env/bin/activate
```

2) Export your key

```bash
export OPENAI_API_KEY="..."
```

3) Run the evaluation

```bash
python script/eval_gpt5.py \
  --red_corner_model_path "BASE_MODEL_PATH" \
  --blue_corner_model_path "CHALLENGER_MODEL_PATH" \
  --tp 8 \
  --port 8200 \
  --server_max_model_len 3072 \
  --gen_max_tokens 1024 \
  --write_summary \
  --judge_concurrency 8
```

By default, this uses prompts from `trl/data/gpt5-eval.json` and writes results under `eval/output/`.
It prints the challenger win rate (wins + 0.5·ties) against the base model and also writes `summary.json` when `--write_summary` is enabled.

## Hyperparameters (paper defaults)

### Beta (\`beta\`)

- **DPO-LW** and **RACO** do not include a length term in the objective, so we use a fixed `beta=0.2` across setups (matching TRL’s default DPO beta).
- **AMoPO** includes a length term and typically requires tuning. We sweep `beta` in `[0.1, 1.5]` with step `0.1`.

Best `beta` from our sweeps:

- Qwen3-4B-Base: `beta=0.8`
- Qwen3-4B-Instruct: `beta=0.4`
- gemma-3-4B-pt: `beta=0.7`
- gemma-3-4B-it: `beta=0.8`
- Llama3-8B-Instruct: `beta=0.1`

### Learning rate

Following SimPO (Meng et al., 2024), we sweep learning rates in \([1e-7, 5e-5]\).
Best learning rates from our sweeps:

- Qwen3-4B-Base: AMoPO `1e-5`; RACO/DPO-LW `2.75e-5`
- Qwen3-4B-Instruct: AMoPO `1e-5`; RACO/DPO-LW `2e-5`
- gemma-3-4B-pt: AMoPO `8e-6`; RACO/DPO-LW `1.5e-5`
- gemma-3-4B-it: AMoPO `5e-6`; RACO/DPO-LW `1.5e-5`
- Llama3-8B-Instruct: AMoPO `5e-7`; RACO/DPO-LW `7e-7`

Since the main difference between RACO and DPO-LW is how conflicted gradients are resolved, we generally observe **similar learning-rate optima** for these two methods.