<div align="center">

# **Spotlight on Token Perception for Multimodal Reinforcement Learning**

</div>

This repository contains the official implementation for **VPPO (Visually-Perceptive Policy Optimization)**, a novel policy gradient algorithm designed to resolve the foundational misalignment between uniform learning signals and non-uniform reasoning processes in multimodal RL. By introducing a hierarchical mechanism guided by *visual dependency*, VPPO steers the learning signal toward robust, visually-grounded reasoning paths over spurious shortcuts, achieving new state-of-the-art performance.

## 🌟 Key Highlights

- **🧠 Novel Framework:** Introduces **VPPO**, a new RL algorithm that explicitly integrates token-level visual perception into the policy update, resolving the core signal dilution problem in multimodal RLVR.
- **🏆 State-of-the-Art Performance:** Achieves commanding leads over strong open-source baselines on a suite of 8 challenging multimodal reasoning benchmarks, validated on both **Qwen2.5-VL 7B and 32B** scales.
- **🛡️ Enhanced Stability:** VPPO's hierarchical signal modulation acts as a potent implicit regularizer, overcoming the catastrophic **late-stage performance collapse** that plagues standard RL methods.
- **🔌 Plug-and-Play:** VPPO is designed as a modular enhancement that can be seamlessly integrated into mainstream RLVR algorithms like GRPO and DAPO.

## 📖 Methodology

Standard RLVR frameworks suffer from a critical flaw: a single, coarse reward is broadcast indiscriminately to all tokens in a sequence. This **uniform learning signal** is misaligned with the **non-uniform nature of visually-grounded reasoning**, where only a sparse subset of tokens are truly pivotal.

**VPPO** resolves this by introducing a dual mechanism guided by a new metric, **token-level visual dependency**:
1.  **Macro-level (Trajectory Advantage Shaping):** The advantage of an entire trajectory is re-weighted by its average visual dependency. This prioritizes learning from robust, perception-driven reasoning paths over brittle, shortcut-based ones.
2.  **Micro-level (Token Gradient Filtering):** Policy updates are focused *exclusively* on the top-k% most visually-dependent tokens. This directly counters signal dilution and reduces gradient variance, leading to more stable and effective training.

For a detailed explanation of the theory and implementation, please refer to our paper.

## 📊 Data

We adapt multiple multimodal reasoning benchmarks to construct our training and evaluation datasets.

### **Training Data**

- **Training**: We use [TIGER-Lab/ViRL39K](https://huggingface.co/datasets/TIGER-Lab/ViRL39K) for training our models.
- **Validation**: We use the testset from [MMK12](https://huggingface.co/datasets/FanqingM/MMK12) for validation during training.

### **Evaluation Data**
We evaluate **VPPO** on 8 diverse multimodal reasoning benchmarks. For `MathVerse`, `MathVision`, and `DynaMath_Sample`, we filter out instances with free-form answers to ensure verifiable, exact-match evaluation and to avoid reliance on LLM-as-a-judge.

| Benchmark         | Hugging Face Link                                            | Focus Domain         |
| ----------------- | ------------------------------------------------------------ | -------------------- |
| Geo3k             | [`hiyouga/geometry3k`](https://huggingface.co/datasets/hiyouga/geometry3k) | Geometric Reasoning  |
| We-Math           | [`We-Math/We-Math`](https://huggingface.co/datasets/We-Math/We-Math) | Mathematical Reasoning |
| MMK12             | [`FanqingM/MMK12`](https://huggingface.co/datasets/FanqingM/MMK12) | Mathematical Reasoning |
| MathVerse         | [`AI4Math/MathVerse`](https://huggingface.co/datasets/AI4Math/MathVerse) | Mathematical Reasoning |
| MathVision        | [`MathLLMs/MathVision`](https://huggingface.co/datasets/MathLLMs/MathVision) | Mathematical Reasoning   |
| DynaMath          | [`DynaMath/DynaMath_Sample`](https://huggingface.co/datasets/DynaMath/DynaMath_Sample) | Mathematical Reasoning |
| LogicVista        | [`lscpku/LogicVista`](https://huggingface.co/datasets/lscpku/LogicVista) | Logical Reasoning    |
| MMMU-Pro          | [`MMMU/MMMU_Pro`](https://huggingface.co/datasets/MMMU/MMMU_Pro) | Multi-discipline     |

All results in the paper are reported as **average accuracy @ 8**, with an inference temperature of **1.0**.

## 🚀 Quick Start

### **Prerequisites**
- Python 3.10+
- PyTorch 2.6+
- CUDA 12.4+

### **Environment Setup**

```bash
# Create and activate conda environment
conda create -n vppo python=3.10
conda activate vppo

# Clone and install the repository
# git clone ...
cd VPPO
pip install -e .
```

### **Training**

The main training pipeline is adopted from [EasyR1](https://github.com/hiyouga/EasyR1). We support training with different configurations for both `Qwen2.5-VL-7B` and `32B` models.
- **Qwen2.5-VL-7B:** We use 8 x H800 (80G) GPUs.
- **Qwen2.5-VL-32B:** We use 32 x H800 (80G) GPUs.

#### **Train VPPO-7B**
```bash
# For the 7B model
cd VPPO
bash examples/configs/main.sh
```

### **Performance Evaluation**

Our evaluation leverages the framework and scripts provided in [PAPO-Eval](https://github.com/xhguo7/PAPO-Eval).
