Keywords: video reasoning
Abstract: Vision-language models (VLMs) have made remarkable progress in video reasoning tasks. However, they still frequently produce inaccurate reasoning chains, such as hallucinating nonexistent objects, misreading perceptual details, or confusing spatial and temporal relations. To address these challenges, we introduce VideoCritic-Bench, a benchmark that targets fine-grained reasoning errors in video-language understanding, and VideoCritic-3B, a 3B-parameter critic model that detects and categorizes reasoning errors. VideoCritic-Bench contains two complementary splits: (1) Synthetic, constructed by injecting controlled reasoning errors into ground-truth chains; and (2) Realistic, a human-verified collection of authentic reasoning errors mined from both small and large VLMs. Together, these splits support systematic training and realistic evaluation of video-reasoning robustness. We further develop VideoCritic-3B, a lightweight critic model designed for reliable and stable reasoning-error detection. The model is trained with supervised fine-tuning followed by direct preference optimization (DPO), yielding strong performance across multiple error types. Experimentally, VideoCritic-3B delivers competitive results and surpasses larger baselines on hallucination and perceptual errors while remaining computationally efficient.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 4
Loading