# **[Anonymous Code] PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning**

## 📊 **Data**

We adapt multiple multimodel reasoning benchmarks to construct our training and evaluation datasets.

### **Training Data**

- **Training**: We adapt [TIGER-Lab/ViRL39K](https://huggingface.co/datasets/TIGER-Lab/ViRL39K) for training. The processed dataset can be found at: [Anonymous]().
- Validation (optional): We use the testset from [MMK12](https://huggingface.co/datasets/FanqingM/MMK12) for validation during training. **Note that this is solely for monitoring, we do not pick checkpoints based on this.** The processed dataset can be found [Anonymous]().

### **Evaluation Data**
We adapted 8 different multimodal reasoning benchmarks to evaluate **PAPO**, which are further identify two groups, including `General Multimodal Reasoning` and `Vision-Dependent Multimodal Reasoning`:

[Anonymous]()

All results in the paper are average accurarcy @ 8 (repeating 8 times), with a temperature set to 1.0.

## 🚀 **Quick Start**

### **Environment Setup**

#### **Option 1: All-in-one Installation Script**
```bash
conda create -n papo python=3.10
conda activate papo

cd PAPO
bash scripts/install.sh
```

#### **Option 2: Using pip**
```bash
pip install -e .
```

### **Training**

The main training pipeline is adopted from [EasyR1](https://github.com/hiyouga/EasyR1). We support training with different configurations for both `Qwen2.5-VL 3B` and `7B` models:
- **Qwen2.5-VL 3B:** We typically use 2 `80G H100` GPUs
- **Qwen2.5-VL 7B:** We typically use 4 `80G H100` GPUs

#### **GRPO Baseline**
```bash
# 3B model
cd PAPO
bash examples/qwen2_5_vl_3b_grpo.sh

# 7B model  
cd PAPO
bash examples/qwen2_5_vl_7b_grpo.sh
```

#### PAPO (γ = 0.01)
```bash
# 3B model
cd PAPO
bash examples/qwen2_5_vl_3b_papo.sh

# 7B model  
cd PAPO
bash examples/qwen2_5_vl_7b_papo.sh
```

#### PAPO_H (γ = 0.02)
```bash
# 3B model
cd PAPO
bash examples/qwen2_5_vl_3b_papo_high.sh

# 7B model (with double entropy loss)
cd PAPO
bash examples/qwen2_5_vl_7b_papo_high.sh
```

#### PAPO + No Reference KL
```bash
# 3B model (with double entropy loss)
cd PAPO
bash examples/qwen2_5_vl_3b_papo_no_kl_ref.sh

# 7B model (with double entropy loss)
cd PAPO
bash examples/qwen2_5_vl_7b_papo_no_kl_ref.sh
```

### **Pretrained Checkpoints**

A collection of 7B/3B pretrained checkpoints on ViRL39K can be downloaded from [Anonymous](). The checkpoints follows Qwen2.5-VL Huggingface format, which can be inferenced as drop-in replacement to https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct. 

### **Performance Evaluation**

To run model inference and evaluation, we integrate the evaluation submodule located at `PAPO/PAPO-Eval`.
Detailed instructions for running inference and evaluation can be found in [Anonymous]().
```bash
# Navigate to PAPO evaluation submodule
cd PAPO-Eval

# Data preprocessing
bash papo_eval/preprocess/preprocess.sh

# Run model inference
bash papo_eval/run_infer.sh

# Run model evaluation
bash papo_eval/run_eval.sh
```

## 🥰 Acknowledgements

We thank the [EasyR1](https://github.com/hiyouga/EasyR1) team for providing the foundational codebase that we adapted to implement PAPO. Our implementation builds upon their efficient RLVR framework and extends it with perception-aware optimization methodologies. We also acknowledge the open-source community for providing the datasets and evaluation benchmarks that made this research possible.

## 📝 Citation

```bibtex
Anonymous
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

<div align="center">

</div>