Keywords: synthetic, controllable, video-LLM, benchmark
TL;DR: We introduce VideoCogQA, a controllable synthetic benchmark to evaluate the cognitive abilities of LVLMs, revealing significant limitations in abstract and symbolic reasoning even for SOTA model
Abstract: Recent advances in Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding. However, it remains uncertain whether these models possess the key cognitive capabilities for high-level tasks, especially those requiring symbolic and abstract reasoning. Existing benchmarks predominantly rely on real-world, annotated videos, which suffer from a lack of control over content and inherent difficulty, limiting their diagnostic utility. To address these limitations, we introduce \textbf{VideoCogQA}, a scalable and fully controllable benchmark inspired by game-based environments, designed to assess the cognitive abilities of LVLMs. By generating synthetic videos through a programmatic engine, VideoCogQA offers precise control over visual elements, temporal dynamics, and the video task difficulty, effectively isolating cognitive reasoning from prior semantic knowledge. The dataset consists of tasks involving abstract concepts, symbolic elements, and multimodal integration, with varying levels of difficulty based on Python-based game scenarios. Experimental results show that even state-of-the-art (SOTA) models, such as Qwen2.5-VL-72B, achieve an average performance of 48.8% on tasks involving abstract concepts. Additionally, performance drops by 15% as task complexity increases, highlighting the challenges LVLMs face in maintaining consistent performance. Through this work, we hope to show the limitations of current LVLMs and offer insights into how they can more effectively emulate human cognitive processes in the future.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 11310
Loading