# VisualThinker-R1-Zero: First ever R1-Zero's Aha Moment on just a 2B non-SFT Model

VisualThinker-R1-Zero is a replication of [DeepSeek-R1-Zero](https://arxiv.org/abs/2501.12948) in visual reasoning. We are **the first** to successfully observe **the emergent “aha moment”** and **increased response length** in **visual reasoning** on just a **2B non-SFT models**.

> Training dynamics of our VisualThinker-R1-Zero training starting from the Qwen-VL-2B, without SFT or reward models. An aha moment and increasing response length is ever observed at a multimodal model.

## 🔮 Highlights
1. We are the **first to successfully produce the emergent “aha moment” and increased response length** for multimodal reasoning on just a **non-SFT 2B model**.
2. We showed that **vision-centric** tasks could also benefit from improved reasoning capabilities.  

Similar to DeepSeek R1, self reflection behavior is also observed during our RL training on vision-centric reasoning tasks. The model exhibits an emergent ability to rethink and correct its mistakes:

```
. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .
```

## 💻 Hardware Requirements

\* *estimated*

| Method                   | Bits |   2B   |
| ------------------------ | ---- | ------ |
| GRPO Full Fine-Tuning    |  AMP | 4*80GB |

## 🧱 Setup

```bash
bash setup.sh
```
## 🤗 Prepare Dataset

```bash
cd src/data/SAT
bash prepare_dataset.sh
```

## 🏋️ Training

### GRPO Training
To reproduce the multimodal aha moment, run the following code to train the non-SFT model with GRPO on SAT:
```bash
cd src/open-r1-multimodal
bash run_grpo_SAT.sh # Adjust open-r1-multimodal/configs/zero3.yaml or zero2.yaml accordingly
```

### SFT Training
To obtain SFT model for comparison, run the following code to train the non-SFT model on SAT:
```bash
cd src/open-r1-multimodal
bash run_sft.sh # Adjust open-r1-multimodal/configs/zero3.yaml or zero2.yaml accordingly
```

## 📈 Evaluation

### CVBench Evaluation
We provide following commands to reproduce our evaluation results on the CVBench. First change to evaluation directory:
```bash
cd src/eval 
```

To evaluate Base + GRPO (VisualThinker R1 Zero) model:
```bash
python evaluate_Qwen2_VL_CVBench-base.py --model_path <path_to_your_model> \
    --bs 8 \
    --use_reasoning_prompt
```
To evaluate Base model:
```bash
python evaluate_Qwen2_VL_CVBench-base.py --model_path <path_to_your_model> \
    --bs 8 \
    --no-use_reasoning_prompt
```
To evaluate Instruct + GRPO model:
```bash
python evaluate_Qwen2_VL_CVBench.py --model_path <path_to_your_model> \
    --bs 8 \
    --use_reasoning_prompt
```
To evaluate Instruct model:
```bash
python evaluate_Qwen2_VL_CVBench.py --model_path <path_to_your_model> \
    --bs 8 \
    --no-use_reasoning_prompt
```
