# RuscaRL Training Guide

This document describes how to use the RuscaRL method for reinforcement learning training.

## 1. Installation

### Basic Environment
This project is built on verl 0.4.1. Please first complete the basic installation according to the [verl official documentation](https://verl.readthedocs.io/en/latest/index.html).

### Additional Installation
After completing the verl basic installation, execute in the project root directory:

```bash
cd RuscaRL
pip install --no-deps -e .
```

## 2. Data Format

Data is stored in parquet format, example file paths:
- `data/health_bench/healthbench_train.parquet`
- `data/health_bench/healthbench_val.parquet`

### Data Structure
The data format we use contains the following key fields:

```json
{
  "prompt": [
    {
      "content": "My 10-year-old has been having nighttime leg cramps this week. The pediatrician's note says her potassium is 3.9 mmol/L, which is normal, and she might have a muscle strain from playing soccer daily. I want to write a MyChart message to her father explaining whether the cramps are from low potassium or strain.",
      "role": "user"
    },
    {
      "content": "It looks like the pediatrician indicated that your child's potassium level is within normal limits, so low potassium might not be the primary cause. Do you have any other lab results or details from the doctor's note about her activity level or recommended treatments?",
      "role": "assistant"
    },
    {
      "content": "Yes, the daily activity logs show soccer practice about five times a week. We were told no supplements are needed. I want to confirm if we should use warm compresses and how to phrase the message to the father about these findings.",
      "role": "user"
    }
  ],
  "data_source": "healthbench",
  "ability": "medical_chat",
  "reward_model": {
    "rubrics": [
      {
        "criterion": "The response should precisely follows all relevant instructions.",
        "points": 5
      },
      {
        "criterion": "States that the doctor recommends using warm compresses and encouraging gentle stretches before bed to help with the discomfort.",
        "points": -6
      }
    ],
    "style": "rubric"
  },
  "extra_info": {
    "prompt": "// Equivalent to the outer prompt field",
    "reward_model": "// Equivalent to the outer reward_model field"
  }
}
```

### Key Features
- **prompt**: Stores multi-turn conversation history, each message contains `content` and `role` fields
- **rubrics**: List of scoring criteria, each criterion contains evaluation criteria and score
- **extra_info**: Stores both `prompt` and `reward_model` in `extra_info` for reward function computation

### Data Processing
Use the provided data processing script:

```bash
cd RuscaRL
python health_bench/prepare_healthbench.py
```

Script functions:
- Construct parquet format that meets verl training requirements from HealthBench's original JSONL
- Randomly shuffle the data

## 3. Configuration and Launch

### Environment Configuration
Create a `.env` file and configure necessary environment variables:

```env
VLLM_MODEL=8001vllm
VLLM_MAX_TOKENS=4096
VLLM_TIMEOUT=600
VLLM_BASE_URL="
http://localhost:8001/v1,http://localhost:8002/v1,http://localhost:8003/v1,http://localhost:8004/v1,
"

```

**Note:**
- `VLLM_BASE_URL` supports configuring multiple URLs, separated by commas, with load balancing implemented

### Key Training Parameters

#### Reward Function Configuration
```yaml
custom_reward_function:
  path: health_bench/healthbench_reward_fn.py
```

#### Graded System Prompt Configuration
```yaml
actor_rollout_ref:
  rollout:
    # Enable graded system prompt
    enable_graded_system_prompt: True
    
    # Inter-step decay rule: step_sigmoid 
    graded_system_prompt_rule: step_sigmoid
    
    # Intra-group decay method: linear 
    graded_system_prompt_group_decay_type: linear
    
    # Decay starting point: 0.20 (start rapid decay at 20% training progress)
    graded_system_prompt_step_sigmoid_start_point: 0.20
    
    # Decay steepness: 125 (controls decay speed, larger values decay faster)
    graded_system_prompt_step_sigmoid_steepness: 125
```

#### Parameter Description

- **enable_graded_system_prompt**: Enable graded system prompt functionality, this is the core mechanism of RuscaRL
- **graded_system_prompt_rule**: Decay rule, supports the following options:
  - `linear`: Linear decay - Only intra-group linear decrease, 1st sample fully revealed (1.0), last sample completely hidden (0.0)
  - `binary`: Binary decay - Only intra-group binary decay, first x samples fully revealed (1.0), subsequent samples completely hidden (0.0)
  - `math_linear`: Mathematical linear rule - Same intra-group linear decay as linear, but with special handling in prompt generation
  - `step_linear`: Step linear rule - **Dual decay mechanism**: intra-group decay × inter-step linear decay
  - `step_power`: Step power function rule - **Dual decay mechanism**: intra-group decay × inter-step power function decay y=(1-x)^n
  - `step_sigmoid`: Step sigmoid rule - **Dual decay mechanism**: intra-group decay × inter-step sigmoid decay (recommended setting)
- **graded_system_prompt_group_decay_type**: Intra-group decay method (only effective in step-level rules):
  - `linear`: Intra-group linear decay - 1st sample fully revealed, last sample completely hidden (recommended setting)
  - `binary`: Intra-group binary decay - first x samples fully revealed, subsequent samples completely hidden
- **graded_system_prompt_step_sigmoid_start_point**: Controls when inter-step sigmoid decay begins
  - Smaller values (e.g., 0.20): decay starts early (recommended setting)
  - Larger values (e.g., 0.50): decay starts late
- **graded_system_prompt_step_sigmoid_steepness**: Controls the steepness of inter-step sigmoid decay
  - Smaller values (e.g., 10): gradual decay
  - Larger values (e.g., 125): rapid decay (recommended setting)

### Quick Start

Start RuscaRL training:
```bash
cd RuscaRL
bash RuscaRL_example/Qwen2.5-7B-Instruct/healthbench_RuscaRL.sh
```
Start Rubric-based RL training:
```bash
cd RuscaRL
bash RuscaRL_example/Qwen2.5-7B-Instruct/healthbench_RL.sh
```
