VIPO-R1: Cultivating Video Reasoning in MLLMs via Verifier-Guided Iterative Policy Optimization

yunxin li; Xinyu Chen; Zitao Li; zhenyu liu; Longyue Wang; Wenhan Luo; Baotian Hu; Min Zhang

VIPO-R1: Cultivating Video Reasoning in MLLMs via Verifier-Guided Iterative Policy Optimization

yunxin li, Xinyu Chen, Zitao Li, zhenyu liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang

20 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video understanding, video question answering

Abstract: Applying Reinforcement Learning (RL) to Multimodal Large Language Models (MLLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance. To address these limitations, we propose **VIPO-R1**, a **V**erifier-guided **I**terative **P**olicy **O**ptimization method designed to gradually enhance MLLMs' ability to generate long-term reasoning chains for challenging VideoQA. The core component is the Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful MLLMs (e.g., Kimi-VL) and thinking models (e.g., Video-R1), highlighting its effectiveness and stability.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 25012

Loading