## Getting Started
### Environment Preparation
You can install the python dependencies by executing following commands:
```shell
conda create -n perception-r1 python=3.10 -y

conda activate perception-r1
pip3 install -r requirements.txt
```

### Data Preparation
You can upzip the `data/Perception-R1-data.tar.gz` tarball, and then put it under the `data` directory.

### Training
The entire training process can be completed in approximately 16 hours using 16 NVIDIA A800-80G GPUs, with 8 GPUs allocated for vLLM serving and the remaining 8 for RL training.

You should first start a vllm server:
```shell
bash examples/vllm_qwen_serve.sh
```
Then modify the `VLLM_SERVER_BASE_URL`, `VLLM_SERVER_API_KEY` and `VLLM_MODEL_NAME` fields in `examples/run_qwen2_5_vl_7b_geo3k_visual_swanlab.sh` based on your config.

Once the vllm server is initialized, you can run following scripts to start training:
```shell
bash examples/run_qwen2_5_vl_7b_geo3k_visual_swanlab.sh
```

## Major Modifications
Our modifications for incorporating the visual perception reward are primarily implemented in the following modules:
- `verl/workers/reward/custom.py`: Contains the core logic for assigning rewards to model-generated responses, including the implementation of the N-gram penalty reward.
- `verl/utils/reward_score/math_with_visual.py`: Implements the visual perception reward, which evaluates the consistency between visual annotations and generated responses.
- `verl/utils/reward_score/boxed_math_verify.py`: Implements the accuracy and format rewards used during the training process to ensure correctness and structured output format.
