# ITPO: Implicit Turn-Wise Policy Optimization Recipe for VeRL

This repository contains the training recipe for **ITPO (Implicit Turn-Wise Policy Optimization)** and related algorithms (PRIME, PTPO, GRPO), built on top of the [VeRL (Volcano RL)] library.


## ⚠️ Prerequisites & Integration

**Note:** This code folder itpo should be placed as ther recipes in the recipe folder of VeRL, and is modified based on the original prime and collabllm recipe.  It relies on the `verl` framework.

1.  **Base Library**: have `verl` installed and configured.
2.  **Directory Structure**: This folder (`itpo`) is intended to be placed inside the `verl/recipes/` directory.
3.  **Core Dependencies**: This recipe assumes that specific interfaces (e.g., `AsyncActorRolloutRefWorker`, specific callback hooks in `RayPPOTrainer`) exist in your version of the `verl` core library.

## 📂 Installation Layout

To use this recipe, organize your file structure as follows:

```text
verl/ (Root of the verl repository)
├── verl/ (Core library)
│   ├── trainer/
│   ├── workers/
│   └── ...
├── recipes/
│   └── itpo/ (This directory)
│       ├── script/              # Demo scripts
│       ├── metrics/             # Reward metric implementations (User provided)
│       ├── itpo_core_algos.py   # Core math: RLOO, GRPO, Advantage calculation and the ITPO, Norm-ITPO algorithms
│       ├── main_prime.py        # Entry point
│       ├── prime_ray_trainer.py # Custom Ray Trainer for PRIME/ITPO
│       ├── prime_dp_rm.py       # DataParallel Reward Model implementation
│       ├── reward_function.py   # Async reward & LLM Judge logic
│       ├── medical_agent_loop.py
│       ├── collabllm_agent_loop.py
│       └── ...


The other folder, Verl, contains example files that should be placed to replace in the vanilla VeRL library.
