VideoSearch Reasoner: Boosting Multimodal Reward Models through Think with Image Reasoning

VideoSearch Reasoner: Boosting Multimodal Reward Models through Think with Image Reasoning

ICLR 2026 Conference Submission15605 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Reward Model, Multimodal LLM

TL;DR: We propose a Multimodal Reward Model utilizing Thinking-with-Image framework

Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: **(1)** visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and **(2)** all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce **VideoSearch Reasoner**, a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: **(i)** Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; **(ii)** select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and **(iii)** apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VideoSearch Reasoner achieves 80.5\% on VideoGen Reward, 82.3\% on GenAI-Bench, and 75.6\% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15605

Loading