ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

Published: 08 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (up to 4 pages)
Keywords: Multi-modal large language models, Reasoning, Reinforcement Learning
Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose \textbf{ProcessThinker}, a practical post-training pipeline that provides step-level \emph{process rewards} without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model \textsc{Qwen3-VL-8B-Instruct}.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 117
Loading