Keywords: Reasoning, Reinforcement Learning with Verifiable Reward, Multimodal Large Language Model
TL;DR: In this paper, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model.
Abstract: The recent DeepSeek-R1 demonstrated how reinforcement learning with simple
rule-based reward can enable autonomous development of complex reasoning in
large language models, characterized by the "aha moment", in which the model
manifest self-reflection and increased response length during training. However,
attempts to extend this success to multimodal reasoning often failed to reproduce
these key characteristics. In this study, we present the first successful replication
of these emergent characteristics for multimodal reasoning on only a non-SFT
2B model. Starting with Qwen2-VL-2B and applying reinforcement learning
directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench,
outperforming the base model by approximately ~30% and exceeding SFT setting
by ~2%. By further incorporating a small amount of cold-start data, we achieved
70.58% accuracy on CVBench, a performance surpass GPT-4o-mini. In addition,
we observed that applying RL to instruct models often leads to trivial and low-
diversity reasoning trajectories and presents our insights and attempts to understand
and mitigate this issue.
Submission Number: 61
Loading