# On-Policy-Cold-Start (OPC-SFT)

## Overview

This repository contains code for **Off-Policy Token Clipped Supervised Fine-Tuning (OPC-SFT)**, a modification of standard **Supervised Fine-Tuning (SFT)** designed for **Large Language Models (LLMs)**.

Standard SFT is widely used to adapt pre-trained LLMs to specialized domains, providing a critical **cold-start** for downstream tasks such as reinforcement learning or task-specific evaluation. However, conventional SFT often suffers from **overfitting to a small set of expert demonstrations**, which can lead to **catastrophic forgetting** of pre-trained knowledge and unstable updates, especially for tokens that the base model initially assigns low probability.

**OPC-SFT addresses this challenge** by identifying such “off-policy” tokens and **clipping their gradient updates** during fine-tuning. This strategy limits disruptive parameter changes, stabilizes the early stages of training, and preserves the model’s prior knowledge. In essence, OPC-SFT creates a **more on-policy-like training dynamic** within a supervised fine-tuning framework.

We have conducted extensive experiments on agentic benchmarks such as **ALFWorld** and **ScienceWorld**. The results demonstrate that OPC-SFT not only **reduces forgetting on out-of-distribution tasks** but also **improves downstream task performance** when further training or evaluation is required. Additionally, latent-space analysis shows that OPC-SFT induces **less representational drift** compared to standard SFT, highlighting its robustness for cold-start scenarios.

This repository includes scripts, configuration files, and example datasets to reproduce our experiments and adapt OPC-SFT to new tasks, with a focus on **ease of use, flexibility, and reproducibility**.

------

## Backbone

The **backbone** refers to the pre-trained LLM being fine-tuned. In our experiments, we use the following models as backbones:

- `Llama-3.2-3B-Instruct`
- `Qwen2.5-1.5B-instruct`
- `Qwen2.5-7B-instruct`

These models provide strong natural language understanding and reasoning capabilities. OPC-SFT fine-tunes the selected backbone on task-specific datasets such as **ALFWorld** or **ScienceWorld**, improving performance on multi-turn tasks while preserving general knowledge acquired during pre-training.

------

## Quick Start

### 1. Installation

```bash
# 1. Create and activate Conda environment
conda create -n opc-sft python=3.10 -y
conda activate opc-sft

# 2. Install dependencies
pip install -r requirements.txt

# 3. Install OpenRLHF
pip install openrlhf
pip install openrlhf[vllm]
pip install openrlhf[vllm_latest]

# 4. Install the latest development version
pip install git+https://github.com/OpenRLHF/OpenRLHF.git

# Or clone the repository for development
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd OpenRLHF
pip install -e .
```

> **Tip:** If you plan to run reinforcement learning experiments or perform evaluations later, it’s recommended to install the ALFWorld and ScienceWorld packages in advance:

```bash
pip install git+https://github.com/alfworld/alfworld.git
pip install git+https://github.com/allenai/ScienceWorld.git
```

------

### 2. Running the Training Scripts

After installation, you can start training using the provided example scripts. The scripts run **on-policy PPO-SFT** training for different benchmarks.

#### a) Training on ALFWorld

```bash
# Adjust the paths inside the script according to your setup:
# HOME, PYTHONPATH, MODEL, PROMPT_DATA

bash examples/script/train_pposft_alfworld.sh
```

**Notes:**

- Make sure to adjust the following paths **inside the script** according to your system:
  - `HOME`: base path for this project
  - `PYTHONPATH`: should include your project directory
  - `MODEL`: path to the pre-trained model
  - `PROMPT_DATA`: path to your prompt data JSON
- This script runs **ALFWorld** tasks.
- If you plan to run evaluation or reinforcement learning later, you can install the required environments (`alfworld` and `ScienceWorld`) beforehand.

#### b) Training on ScienceWorld

```bash
bash examples/script/train_pposft_sciworld.sh
```

- This script runs **on-policy PPO-SFT** on **ScienceWorld** benchmarks.
- Adjust paths inside the script as needed.
- Suitable for tasks requiring **scientific reasoning and problem-solving**.

> **Tip:** Always verify that the environment packages (`alfworld` and `ScienceWorld`) are installed before running these scripts.

------

### 3. Key Training Parameters

Here are some important parameters in our **PPO-SFT** training scripts:

- `--eps_clip 0.5`
  - This is the **core parameter of our algorithm** controlling the clipping ratio in PPO-style updates.
  - You can experiment with different values to see how it affects training stability and performance.
- `--use_muti_turn`
  - Determines whether the prompts are **multi-turn conversations**.
  - For our agentic datasets (ALFWorld and ScienceWorld), multi-turn is **required**.
- `--no_ppo`
  - Disables PPO, running **standard supervised fine-tuning (SFT)** instead.
  - When using `--no_ppo`, you can optionally enable:
    - `--dft_baseline`: adds a baseline to reduce variance in standard SFT
    - `--neftune_alpha 0.01`: controls the learning rate of the NEFTune adaptation
- `--use_wandb {wandb_token}`
  - Enables logging to **Weights & Biases (WandB)** for monitoring training.
  - **Important:** You must first run `wandb login` to authenticate your account before using this option.

> **Tip:** For agentic tasks, always keep `--use_muti_turn` enabled. Adjust `--eps_clip` carefully if PPO is used. If `--no_ppo` is enabled, you can explore `--dft_baseline` and `--neftune_alpha` to improve SFT performance.

------

## Acknowledgements

We would like to express our sincere gratitude to the developers and contributors of **OpenRLHF** for providing the training framework, utilities, and example scripts that made this project possible.

We also thank the creators and maintainers of the following datasets for providing high-quality training data for our experiments:

- **[LEVI Project SFT Data](https://huggingface.co/LEVI-Project/sft-data)** — used as supervised fine-tuning data for **ALFWorld**.
- **[AgentTraj-L Dataset](https://huggingface.co/datasets/AgentGym/AgentTraj-L/tree/main)** — used as supervised fine-tuning data for **ScienceWorld**.

Their open-source contributions have greatly facilitated research in reinforcement learning, large language model fine-tuning, and embodied agent benchmarks.

