Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Published: 12 Jun 2025, Last Modified: 21 Jun 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Language Modeling
Keywords: Vision-Language Models,Reinforcement Learning
Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose **TON**, a two-stage training strategy: **(i)** **Supervised Fine-Tuning (SFT) Stage**: This stage includes a simple yet effective “**thought dropout**” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning. **(ii)** **GRPO Stage**: This stage enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that **TON** can *reduce the completion length by up to **90%** compared to vanilla GRPO, without sacrificing performance or even improving it*. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the *model progressively learns to bypass unnecessary reasoning steps as training advances*. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.
Serve As Reviewer: ~Jiaqi_WANG11, ~Kevin_Qinghong_Lin1
Submission Number: 9
Loading