VisualThinker: First ever R1-Zero's Aha Moment on just a 2B non-SFT Model

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ES-Reasoning @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning, Reinforcement Learning with Verifiable Reward, Multimodal Large Language Model
TL;DR: In this paper, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model.
Abstract: The recent DeepSeek-R1 demonstrated how reinforcement learning with simple rule-based reward can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this study, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding SFT setting by ~2%. By further incorporating a small amount of cold-start data, we achieved 70.58% accuracy on CVBench, a performance surpass GPT-4o-mini. In addition, we observed that applying RL to instruct models often leads to trivial and low- diversity reasoning trajectories and presents our insights and attempts to understand and mitigate this issue.
Submission Number: 61
Loading