MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

ICLR 2026 Conference Submission12963 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RL, video understanding

Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit **process inconsistenc**, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce \textbf{MOSS-ChatV}, a reinforcement learning framework with **a Dynamic Time Warping (DTW)–based process reward**. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct **MOSS-Video**, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on the MOSS-Video (test) and improves performance on general video benchmarks such as MVBench. The framework consistently yields gains across different architectures, including Qwen2.5-VL and TinyLLaVA-Video, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12963

Loading