Keywords: RL, video understanding
Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit **process inconsistenc**, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness.
To address this issue, we introduce \textbf{MOSS-ChatV}, a reinforcement learning framework with **a Dynamic Time Warping (DTW)–based process reward**. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct **MOSS-Video**, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation.
MOSS-ChatV achieves 87.2\% on the MOSS-Video (test) and improves performance on general video benchmarks such as MVBench. The framework consistently yields gains across different architectures, including Qwen2.5-VL and TinyLLaVA-Video, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12963
Loading