Auto-Thinking Evocation in Video Reasoning via Multi-Stage Granular Reinforcement Learning: Stable, Controllable

08 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, post training, video reasoning, auto-thinking
TL;DR: A new RL design to stably achieve auto-thinking in video-reasoning
Abstract: R1-style reinforcement learning (RL) for stimulating stepwise reasoning significantly boosts Video-MLLMs' performance on complex tasks, yet drastically impairs response efficiency for simple ones. To further incentivize the auto-thinking capability, existing methods typically incorporate reasoning mode selection into RL reward designs to implicitly regulate thinking preferences across different tasks. However, these methods demand strict tuning of sensitive hyperparameters and careful data management, frequently leading to single-mode dominance when processing video data. To achieve stable and controllable auto-thinking evocation in video reasoning, we design a multi-stage granular RL paradigm. Specifically, the responding process under auto-thinking can be decomposed into two subtasks: 1) Determining the reasoning mode, and 2) Generating correct answers. Due to the self-regressive property of LLMs, the initial token governs the overall response mode, while subsequent tokens critically influence answer correctness. From this insight, we respectively improve the model's ability on the above two subtasks by conducting decoupled RL training on tokens at different positions with two RL phases, Meta-Cognition Training and Cognition-Aware Refinement. In Meta-Cognition Training, we construct a reasoning strategy dataset to explicitly incentivize suitable starting tokens on different questions, which stably prevents single-mode collapse and achieves controllable thinking preferences. For Cognition Aware Refinement, the learning is fully conditioned on reasoning or non-reasoning modes, specifically improving answer accuracy under both modes. Through multi-stage granular RL training, we significantly enhance the reasoning accuracy while steadily endowing the model with auto-thinking ability. Extensive experiments across multiple video reasoning and perception benchmarks demonstrate that our approach achieves distinct thinking rates while significantly reducing responding overhead, ultimately improving overall performance and establishing new state-of-the-art results with superior performance-efficiency trade-offs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3101
Loading