# SynthRL

A reinforcement learning project that uses synthetic data for training vision-language models.

## Environment Setup

To set up the development environment, run the following commands:

```bash
# Create the conda environment
conda create -y -n synthrl python=3.10
conda activate synthrl

# Enter the EasyR1 directory
cd EasyR1

# Install PyTorch and dependencies
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# Install the project in editable mode
pip install -e .

# Install additional dependencies
pip install google-genai
pip install math-verify[antlr4_9_3]
```

## Setup

- Change all instances of `PROJECT_DIR` to your working directory
- Set up API keys:
  - `GOOGLE_API_KEY` for Gemini API access
  - `QWEN` API key (OpenAI compatible) for the Qwen verifier
  - Note: `SILICON_FLOW_API_KEY` in the project is the same as the QWEN API key. You can use any OpenAI-compatible API key that supports the Qwen model.

## Data Synthesis

```bash
./scripts/run_evolve_verifiable.sh
```

This command synthesizes data using `Gemini-2.5-flash-preview-04-17` as the synthesizer and `Qwen/Qwen2.5-VL-7B-Instruct` as the verifier.

## RL Training

Example training command (8K A-MMK12, 8 episodes, saving every 16 steps on 8 NVIDIA GPUs):

```bash
bash scripts/run_qwen2_5_vl_7b_geo.sh QWEN2.5-VERIFY-K12-8K-QWEN k12-8k-combine-qwen-v10-nokl ./data/K12-V10-QWEN-8K-THRES8_combined 16 8
```

## Evaluation

Evaluate all checkpoints:

```bash
bash ./scripts/run_eval_vlm_all.sh QWEN2.5-VERIFY-K12-8K-QWEN/qwen2_5_vl_7b_k12-8k-combine-qwen-v10-nokl
```

Evaluation logs will be saved under:

```bash
./evaluation/logs_vlm
```

## Note

Due to the 100MB file size limitation of supplementary materials, we cannot include all our data and logs in this material. The evaluation logs for A-MMK12 8K version (.json) are included in `./evaluation/logs_vlm`. We promise to make all data, checkpoints, logs, and Bradley-Terry battle records publicly available.
