Abstract: Despite advances in vision-language models (VLMs), their ability to perform event-based reasoning across multiple dimensions—temporal, causal, spatial, contextual, and commonsense—remains underexplored. To address this gap, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset. SPLICE consists of 3,381 human-filtered videos spanning 12 categories—e.g., sports, engineering, housework—of varied lengths, segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art VLMs on the task of rearranging these clips into coherent event sequences, thereby assessing their visual reasoning capabilities. Our results reveal a substantial performance gap: current models fail to reconstruct plausible sequences at a level comparable to humans. To further investigate this gap, we introduce human-annotated textual descriptions as additional input to the videos. While introducing these annotations significantly enhance model performance, they do not impact human accuracy, suggesting that models rely heavily on language priors rather than genuine visual comprehension. Even with this added information, models still fall short of human performance, underscoring the challenges of visual reasoning in VLMs.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Resources and Evaluation, Question Answering
Contribution Types: Data resources
Languages Studied: English
Submission Number: 4581
Loading