SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen; Haoqin Tu; Fali Wang; Hui Liu; Xianfeng Tang; Xinya Du; Yuyin Zhou; Cihang Xie

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie

Published: 17 Nov 2025, Last Modified: 17 Nov 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work explores two distinct approaches for enhancing reasoning abilities in Large Vision Language Models (LVLMs): supervised fine-tuning (SFT) and reinforcement learning (RL). To support the SFT approach, we curate a multimodal reasoning dataset with the complete reasoning trace guided by DeepSeek-R1. For the RL approach, we focus on GRPO and develop a training framework tailored to vision-language tasks with a composite reward system comprising four signals that address both visual perception and reasoning challenges. Our extensive experiments reveal that RL is a significantly more effective strategy than SFT for training reasoning VLMs. While SFT can assist models that initially struggle with following reasoning instructions, it often induces ``pseudo aha moments'' that degrade overall reasoning performance, implying that only a minimal amount of SFT data is necessary. In contrast, RL leads to substantial improvements, outperforming recent baseline models on a range of math reasoning tasks by at least 2% on average. We also present several intriguing findings --- \eg, combining SFT and GRPO also hurts the model performance, and stronger instruction-aligned LVLMs consistently lead to better results in RL. We hope these findings provide valuable insights into the development of reasoning-capable VLMs and guide future research in this area.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Devendra_Singh_Dhami1

Submission Number: 5264

Loading