Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

ACL ARR 2025 February Submission4581 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite advances in vision-language models (VLMs), their ability to perform event-based reasoning across multiple dimensions—temporal, causal, spatial, contextual, and commonsense—remains underexplored. To address this gap, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset. SPLICE consists of 3,381 human-filtered videos spanning 12 categories—e.g., sports, engineering, housework—of varied lengths, segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art VLMs on the task of rearranging these clips into coherent event sequences, thereby assessing their visual reasoning capabilities. Our results reveal a substantial performance gap: current models fail to reconstruct plausible sequences at a level comparable to humans. To further investigate this gap, we introduce human-annotated textual descriptions as additional input to the videos. While introducing these annotations significantly enhance model performance, they do not impact human accuracy, suggesting that models rely heavily on language priors rather than genuine visual comprehension. Even with this added information, models still fall short of human performance, underscoring the challenges of visual reasoning in VLMs.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Resources and Evaluation, Question Answering

Contribution Types: Data resources

Languages Studied: English

Submission Number: 4581

Loading