

<h1 align="center" style="margin-top: -50px;">✨ Agentic Reinforced Policy Optimization</h1>




## Table of Contents


- [Overview](#-overview)
- [Quick Start](#-quick-start)
  - [Cold-Start SFT Stage](#-cold-start-sft-stage)
    - [Environment Setup](#1-environment-setup)
    - [Fine-Tuning Model](#2-fine-tuning-model)
  - [ARPO Stage](#-arpo-stage)
    - [Environment Setup](#1-environment-setup-1)
    - [ARPO RL Training](#3-arpo-rl-training)
  - [ARPO Evaluation](#-arpo-evaluation)
    - [Setup vLLM Inference Environment](#1-setup-vllm-inference-environment)
    - [Setup Evaluation Environment](#2-setup-evaluation-environment)
    - [Configure and Run Evaluation](#3-configure-and-run-evaluation)
    - [Calculate Metrics](#4-calculate-metrics)
- [Citation](#-citation)




## 💡 Overview

We propose **Agentic Reinforced Policy Optimization (ARPO)**, **an agentic RL algorithm tailored for training multi-turn LLM-based agent**. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.


- In figure (left), The initial tokens generated by the LLM after receiving **each round of tool-call feedback consistently exhibit a high entropy**. This indicates that external tool-call significantly **introduces uncertainty into the LLM’s reasoning process**.

- In the figure (right), we validate ARPO's performance **across 13 datasets**. Notably, Qwen3-14B with ARPO excelled in Pass@5, **achieving 61.2% on GAIA and 24.0% on HLE**, while requiring only about **half the tool calls** compared to GRPO during training.






## 🏃 Quick Start

Reproducing ARPO requires three steps: cold start fine-tuning (optional), ARPO training, and evaluation. Below, we will provide a detailed explanation.

## ❄️ Cold-Start SFT Stage (Optional)

This stage is meant to help you reproduce our experimental results. If your want to RL from scratch, you can skip this stage.

### 1. Environment Setup

In this step, we will describe how to perform a cold start for the SFT stage using the LLaMA Factory repository. First, set up the environment as follows:

```bash
# First download the ARPO repository (which includes LLaMA-Factory), then
cd ARPO/LLaMA-Factory

# Create a new conda environment
conda create -n sft python=3.10
conda activate sft

# Install dependencies
pip install -r requirements.txt
```

### 2. Fine-Tuning Model


1. Download your SFT dataset and place it in `LLaMA-Factory-main/data/final_sft_edition9.json`. Define the dataset in `dataset_info.json`.

2. Configure Training

Update `LLaMA-Factory/arpo_train_sft/yaml` with the following content:

```yaml
### model
model_name_or_path: <your_model_path>
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: ../examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset_dir: dataset_info
dataset: <your_dataset>
template: qwen
cutoff_len: 15000
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: <your_output_dir>
logging_steps: 10
save_steps: 2000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 7.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

```
Also, update the output directory in arpo_train_sft/sft_train.sh:

```bash
# Output directory
OUTPUT_DIR="<your_output_dir>"
```

After completing the information, you can fine-tune the model using the following command:

```python
bash arpo_train_sft/sft_train.sh
```

---

## 🔥 ARPO Stage

In this step, we will load the cold-start data for GRPO training. We reference the [ReCall](https://github.com/Agent-RL/ReCall) and [VERL](https://github.com/volcengine/verl) frameworks for RL training.


### 1. Environment Setup

 you can install our additional environment as follow: 

```bash
#create env
conda create -n arpo python==3.10
conda activate arpo

# install torch & flash-atten
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation

# install RL basic env
cd arpo

# This is our RL env freeze file. You can install it as a supplement or use it for checking.
pip install -r requirements.txt

```
---

### 2. Preparation

### 2.1 Data Preparation

In our paper, we offer two type of train & validation datasets to verify the effectiveness of ARPO:

1. **Reasoning and Knowledge Dataset**: This dataset is used to test the benchmarks listed in Table 1.
   - **train_10k.parquet**: Contains 10K samples for mathematical and knowledge reasoning.
   - **test.parquet**: Comprises 300 test samples from 8 datasets, including AIME24, AIME25, MATH500, GSM8k, HotpotQA, 2Wiki, Misque, and Bamboogle.

2. **Deep Search Dataset**: This dataset is used to test the benchmarks listed in Table 2.
   - **hard_search.parquet**: Contains 1K samples, including 800 samples from simpledeepsearch and 200 samples from webdancer.
   - **gaia_test.parquet/hle_test.parquet**: Contains test samples from GAIA and Humanity Last Exam (HLE).


### 2.2 API Key Configuration

Our search api tool utilizes [Bright Data](https://brightdata.com/) (A third-party Bing API, without the retirement risk of official Bing API). Before starting the training, please replace the API key and zone in the following files: `ARPO/scripts/config/ppo_trainer_dr.yaml` and `ARPO/scripts/config/ppo_trainer.yaml`.

Additionally, please also replace the API key and zone in the following file: `/verl_arpo_entropy/verl/workers/rollout/tools/config_example.yaml`. Below is the instruction on how to do this:

<details>
<summary>🔍 Click here! Watch the details of tool API configuration YAML</summary>

```yaml
tools:
  # General tool configuration
  call_limit: 3  # Maximum number of tool calls allowed per sample
  max_workers: 64  # Maximum number of threads for concurrent tool execution
  timeout: 120  # Tool execution timeout (seconds)
  retry_count: 3  # Number of retry attempts for tool execution failures
  verbose_logging: true  # Enable detailed logging
  fail_on_error: false  # Throw an exception if tool loading fails
  
  # Tool instance definitions
  tool_instances:
    python:  
      class_path: verl.workers.rollout.tools.python_tool.PythonTool  # Tool class path
      params:  # Tool-specific parameters
        conda_path: /path/to/conda
        conda_env: verl
    
    search:
      class_path: verl.workers.rollout.tools.search_tool.BingSearchTool
      params:
        api_key: <your_API_key>  # Replace with your Bright Data API key
        zone: <your_zone>  # Replace with your Bright Data zone
        max_results: 10
        result_length: 1000
        location: cn
```

</details>

Make sure to replace `<your_API_key>` and `<your_zone>` with your actual Bright Data API key and zone. This configuration ensures that the search tool is properly set up to perform searches during the training process. If you have any questions or need further assistance, feel free to ask!

---

### 3. ARPO RL Training

We have open-sourced a series of ARPO scripts located in the `/ARPO/scripts/` directory, which includes configurations for 7B, 8B, and 14B models. Below is an example of how to set up and run training for training ARPO. Make sure to replace placeholders like `<your_path_to_ARPO>`, `<your_model_path>`, and `<your_checkpoint_save_dir>` with your actual paths.


<details>
<summary>🔍 Click here! Watch the details of train bash</summary>
  
```bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
PARENT_DIR="$(dirname "$SCRIPT_DIR")"
cd "$PARENT_DIR"
echo "Switched to parent directory: $PARENT_DIR"


# ============================ Environment Setup ============================
# Set basic environment variables
export PYTHONUNBUFFERED=1            
export HYDRA_FULL_ERROR=1           
export VLLM_ATTENTION_BACKEND=XFORMERS 
export VERL_LOGGING_LEVEL=DEBUG
export MKL_SERVICE_FORCE_INTEL=1    
export MKL_THREADING_LAYER=GNU       
export RAY_memory_usage_threshold=0.8  
export RAY_memory_monitor_refresh_ms=0 


# Set Python path
export PYTHONPATH="<your_path_to_ARPO>"/verl_arpo_entropy:$PYTHONPATH

# ============================ Basic Configuration ============================
# Experiment name and project
PROJECT_NAME="reasoning_tasks" # Modify experiment group
EXPERIMENT_NAME="ARPO_global_16_init_8_beam_2_random_0_arpo_0.2_entropy" # Modify experiment name

# Configuration file path
CONFIG_PATH="<your_path_to_ARPO>/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended
CONFIG_NAME="ppo_trainer.yaml"

# Distributed training settings
NNODES=1                            
N_GPUS_PER_NODE=8                   

# ============================ Data Configuration ============================
# Data parameters
PROMPT_KEY="prompt"                 # Prompt field name
TRAIN_BATCH_SIZE=128                # Training batch size
PPO_MINI_BATCH_SIZE=16              # PPO mini-batch size
MAX_PROMPT_LENGTH=1536              # Maximum prompt length
MAX_RESPONSE_LENGTH=4096            # Maximum response length

# Data file paths
TRAIN_FILES="<your_path_to_ARPO>/rl_datasets/train.parquet" # Modify training data path
VALID_FILES="<your_path_to_ARPO>/rl_datasets/valid.parquet" # Modify validation data path

# ============================ Model Configuration ============================
# Actor model path
ACTOR_MODEL_PATH="<your_model_path>" # Modify training model path

# ============================ Rollout Configuration ==========================
# Rollout settings
ROLLOUT_NAME="vllm"                 # Use vllm engine
ROLLOUT_MODE="sync_with_tool"       # Synchronous mode with tool support
ROLLOUT_N=16                         # Number of responses generated per sample
INITIAL_ROLLOUTS=8                 # Initial rollout number
BEAM_SIZE=2                        # Beam size
BRANCH_PROBABILITY=0.5             # Branch probability
Entropy_weight=0.2
# ============================ Rollout Tools Configuration ==========================
SEARCH_CACHE_PATH="<your_path_to_ARPO>/search_cache/search_cache.json" # Modify

# ============================ Reward Model Configuration ==========================
# Reward model settings
REWARD_MANAGER="naive"              # Reward manager type
CUSTOM_REWARD_FUNCTION_PATH="<your_path_to_ARPO>/verl_arpo_entropy/verl/utils/reward_score/deep_research.py" # Modify reward function path
CUSTOM_REWARD_FUNCTION_NAME="compute_score"

# ============================ Training Configuration ============================
# Training parameters
TOTAL_EPOCHS=2                      # Total training epochs
SAVE_FREQ=5                        # Save frequency
TEST_FREQ=5                        # Test frequency

# ============================ Path Configuration ============================
# Save path
SAVE_PATH="<your_checkpoint_save_dir>/${EXPERIMENT_NAME}" # Modify save path
ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout"

# ============================ WandB Configuration ============================
# WandB settings
WANDB_API_KEY="<your_wandb_key>" # Modify your wandb key
SEARCH_CLASS_PATH="verl.workers.agent.tools.search_tool.BingSearchTool"
# ============================ Preparation ============================
# Login to WandB (if API key is provided)
if [ "$WANDB_API_KEY" != "" ]; then
    wandb login --relogin $WANDB_API_KEY
    export WANDB_DIR=${SAVE_PATH}
fi

# Create save directory
if [ ! -d "$SAVE_PATH" ]; then
    mkdir -p $SAVE_PATH
fi

# Create rollout save directory
if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then
    mkdir -p $ROLLOUT_SAVE_PATH
fi

# ============================ Start Training ============================
python3 -m verl.trainer.main_ppo \
    --config-path=$CONFIG_PATH \
    --config-name=$CONFIG_NAME \
    algorithm.adv_estimator=grpo \
    algorithm.kl_ctrl.kl_coef=0.0 \
    data.train_files=${TRAIN_FILES} \
    data.val_files=${VALID_FILES} \
    data.prompt_key=${PROMPT_KEY} \
    data.train_batch_size=${TRAIN_BATCH_SIZE} \
    data.max_prompt_length=${MAX_PROMPT_LENGTH} \
    data.max_response_length=${MAX_RESPONSE_LENGTH} \
    actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \
    actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.rollout.n=${ROLLOUT_N} \
    actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \
    actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \
    actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \
    actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \
    actor_rollout_ref.rollout.tools.tool_instances.search.params.cache_file=${SEARCH_CACHE_PATH} \
    actor_rollout_ref.rollout.tools.tool_instances.search.class_path=${SEARCH_CLASS_PATH} \
    actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    reward_model.reward_manager=${REWARD_MANAGER} \
    custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \
    custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \
    trainer.critic_warmup=0 \
    trainer.logger="[console, wandb]" \
    trainer.project_name=${PROJECT_NAME} \
    trainer.experiment_name=${EXPERIMENT_NAME} \
    trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \
    trainer.nnodes=${NNODES} \
    trainer.save_freq=${SAVE_FREQ} \
    trainer.test_freq=${TEST_FREQ} \
    trainer.total_epochs=${TOTAL_EPOCHS} \
    trainer.default_local_dir=${SAVE_PATH} \
    trainer.val_before_train=False \
    trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \
    hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.log
```

</details>


You can then run the following script to start training:

```bash
cd ./ARPO/scripts/
bash ARPO_7B_Reasoning_1node.sh
```

For the trained RL checkpoint, you can follow the code below to convert the weights to Hugging Face format：

```bash
bash ./ARPO/merge_ckpt/convert_checkpoint_from_verl_to_hf_qwen3.sh
```




---

## ✅ ARPO Evaluation

If you have already trained a model, you can refer to the following process for TIR capability evaluation.
This guide walks you through setting up two separate environments:
- One for **vLLM inference service** (`vllm_env`)
- One for **evaluation pipeline** (`evaluation`)

### 1. Setup vLLM Inference Environment

```bash
# Step into the vllm_scripts directory
cd evaluation/vllm_scripts

# Create a dedicated conda environment for vLLM
conda create -n vllm_env python=3.10
conda activate vllm_env

# Install dependencies (edit as needed)
pip install -r requirements.txt
```

Edit the following launch scripts with your own model paths and names:

In `vllm_launch_reasoning_model_cuda4-7.sh`:

```bash
MODEL_PATH="<path/to/your/reasoning_model_checkpoint>"
MODEL_NAME="your_model_name"
```

For summarization models (choose one):
```bash
MODEL_PATH="<path/to/your/summarization_model_checkpoint>"
MODEL_NAME="your_summarization_model_name"
```

Launch the vLLM services:
```bash
# Start the reasoning model
bash vllm_launch_reasoning_model_cuda4-7.sh

# Start the summarization model (choose one)
bash vllm_launch_summarize_model_cuda0-3_<your_model>.sh
```

---

### 2. Setup Evaluation Environment

```bash
# Create a separate environment for evaluation
conda create -n evaluation python=3.10
conda activate evaluation

# Install required packages
cd evaluation
pip install -r requirements.txt
```

---

### 3. Configure and Run Evaluation

Edit the `infer_local_sds.sh` script with the following values:

```bash
# Activate your Conda environment manually if 'conda' is not available in shell
source < /path/to/your/conda >/bin/activate
conda activate < your env name >

# Datasets to evaluate — uncomment the ones you want to include:
# Options: aime24, aime25, math500, gsm8k, math, webwalker, hotpotqa, 2wiki, bamboogle, musique, hle, gaia, SimpleQA, xbench
data_names=(
    "hle"
    "gaia"
)

# Required parameters to update:
EXP_NAME="<your_exp_name>"                   # Name of this experiment run
MODEL_PATH="<your_model_path>"               # Path to the reasoning model
OUTPUT_PATH="<your_output_path>"             # Directory to save outputs
CONDA_PATH="<your_conda_path>"               # Path to your Conda installation
CONDA_ENV="<your_env_name>"                  # Name of your Conda environment
BING_API_KEY="<your_bing_search_api_key>"    # Bing Search API key
BING_ZONE="<your_bing_zone>"                 # Bing API zone
SUMM_MODEL_PATH="<your_summarization_model_path>"  # Path to summarization model checkpoints
```
> For Bing API usage, please refer to [Bright Data](https://brightdata.com/).

Run the evaluation:
```bash
bash evaluation/infer_local_sds.sh
```

> 🔸 For Chinese datasets like `xbench`, use `infer_local_sds_cn.sh` instead.


### 4. Calculate Metrics

After generating inference results, you can use a large model like **Qwen2.5-72B-Instruct** to evaluate them with more powerful understanding capabilities.

First, use the vLLM environment to start the evaluation model:

```bash
bash evaluation/deploy_qwen2.5_72B_instruct.sh
```

In that script, make sure to update the `vllm serve` command with your own model path:

```bash
# Activate your Conda environment manually if 'conda' is not available in shell
source < /path/to/your/conda >/bin/activate
conda activate < your env name >

vllm serve <your_model_path> \
  --served-model-name Qwen2.5-72B-Instruct \
  --max-model-len 32768 \
  --tensor_parallel_size 4 \
  --gpu-memory-utilization 0.75 \
  --quantization gptq \
  --port 8001
```

Before running the evaluation script, update the following line in `evaluate_passk.sh` to specify the output directory:

```bash
OUTPUT_DIR="<your_result_directory>"
```

Then, run the evaluation script to calculate metrics:

```bash
bash evaluation/evaluate_passk.sh
```

