VisualThinker: First ever R1-Zero's Aha Moment on just a 2B non-SFT Model

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning, Reinforcement Learning with Verifiable Reward, Multimodal Large Language Model
TL;DR: In this paper, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model.
Abstract: Recently, DeepSeek-R1 demonstrated how reinforcement learning with simple rule-based reward can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this paper, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding SFT setting by ~2%. By further incorporating a small amount of cold-start data, we achieved 70.58% accuracy on CVBench, a performance surpass GPT-4o-mini. In addition, we observed that applying RL to instruct models often leads to trivial and low-diversity reasoning trajectories and presents our insights and attempts to understand and mitigate this issue.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23070
Loading