Anchored Policy Optimization (APO) — Code (double-blind)

This repository contains the implementation, training scripts, and evaluation utilities for the method described in the associated manuscript. The repository is intended for reproducible experiments; author information and external links are omitted for double-blind review.
This codebase provides implementations, training scripts, and evaluation tools used to reproduce experiments in the paper.


## Key points

- Purpose: code and scripts to reproduce experiments for the method under review. This README is intentionally minimal for double-blind review.

## Layout (essential)
- `run_scripts/` — convenience shell wrappers to launch training.
- `scripts/` — evaluation and helper scripts.
- `data/` — expected dataset locations.
- `outputs/` — experiment outputs and logs.
- `verl/` — core implementation (trainer, models, utils).

## Quick setup
Preferred: run the provided one-step bootstrap script from the repository root:

```bash
bash APO/uv_verl.sh
```

Alternative manual steps:

```bash
# create and activate environment
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
# install package in editable mode
pip install --no-deps -e .
```

Ensure the project root is on `PYTHONPATH` (scripts may already do this):

```bash
export PYTHONPATH=$PYTHONPATH:$(pwd)
```

## Run (example)
Run the main training entrypoint with per-run overrides or use a wrapper in `run_scripts/`.

Short template:

```bash
python -m verl.trainer.main_ppo \
	data.train_files=['data/train.parquet'] \
	data.val_files=['data/val.parquet'] \
	trainer.project_name=project \
	trainer.experiment_name=exp \
	trainer.total_epochs=10
```

Or use a provided wrapper (example):

```bash
bash run_scripts/run_apo_7B.sh
```

## Evaluation (example)

```bash
bash scripts/eval_model.sh outputs/<run_dir>
```

## License
Follow the `LICENSE` at the project root.

If you want, I can further trim examples or add a minimal `configs/` template that matches your usual runs.
