VisualThinker: First ever R1-Zero's Aha Moment on just a 2B non-SFT Model

Hengguang Zhou; Xirui Li; Ruochen Wang; Minhao Cheng; Tianyi Zhou; Cho-Jui Hsieh

VisualThinker: First ever R1-Zero's Aha Moment on just a 2B non-SFT Model

Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning, Reinforcement Learning with Verifiable Reward, Multimodal Large Language Model

TL;DR: In this paper, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model.

Abstract: Recently, DeepSeek-R1 demonstrated how reinforcement learning with simple rule-based reward can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this paper, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding SFT setting by ~2%. By further incorporating a small amount of cold-start data, we achieved 70.58% accuracy on CVBench, a performance surpass GPT-4o-mini. In addition, we observed that applying RL to instruct models often leads to trivial and low-diversity reasoning trajectories and presents our insights and attempts to understand and mitigate this issue.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23070

Loading