Keywords: Visual Reinforcement Fine-Tuning, explicit thinking, overthinking
TL;DR: We study the thinking process in visual reinforcement finetuning
Abstract: This paper investigates the role of explicit thinking process in rule-based reinforcement fine-tuning (RFT) for multi-modal large language models (MLLMs). We first extend \textit{Thinking-RFT} to image classification task, using verifiable rewards for fine-tuning~(FT). Experiments show {Thinking-RFT} significantly outperforms supervised FT and yields a cross-dataset generalization effect. We then rethink and question whether explicit thinking in RFT is always necessary and beneficial. Challenging the convention that explicit thinking is crucial for the success of RFT, we introduce \textit{No-Thinking-RFT}, exploring RFT without thinking by introducing a simple equality accuracy reward. We evaluate No-Thinking-RFT on six diverse tasks across different model sizes and types. Experiment results reveal four key findings: \textbf{(1).} Visual perception tasks do not require thinking during RFT, as No-Thinking-RFT consistently outperforms or matches Thinking-RFT across model sizes and types. \textbf{(2).} Models with limited capabilities struggle to generate high-quality CoT for RFT, making Thinking-RFT less effective than No-Thinking-RFT. \textbf{(3).} There are inconsistencies between the answers in the thinking tags and answer tags for some responses of Thinking-RFT, which show lower average accuracy than the overall accuracy. \textbf{(4).} The performance gain of No-Thinking-RFT mainly stems from improved learning during no thinking FT and the avoidance of inference overthinking, as evidenced by the partial gains from appending empty thinking tags at inference time of Thinking-RFT. We hypothesize that explicit thinking before verifiable answers may hinder reward convergence and reduce performance in certain scenarios. To test this, we propose \textit{Think-After-Answer}, which places thinking after the answer to mitigate this effect for experimental verification. Lastly, we conduct a pilot study to explore whether MLLMs can learn when to think during RFT, introducing an \textit{Adaptive-Thinking} method. Experiments show that model converges to either thinking or not depending on model capability, achieving comparable or better performance than both Thinking and No-Thinking-RFT. Our findings suggest MLLMs can adaptively decide to think or not based on their capabilities and task complexity, offering insights into the thinking process in RFT.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 3789
Loading