Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Jianrui Zhang; Mu Cai; Yong Jae Lee

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Jianrui Zhang, Mu Cai, Yong Jae Lee

18 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: temporal reasoning; counterfactual reasoning; short video comprehension

TL;DR: Modern SoTA LMMs still demonstrates subpar performance at temporal reasoning with our temporal counterfactual benchmark composed of natural videos.

Abstract: There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains $\sim$50\% on our text and video scores, showing a large gap compared to the human baseline of $\sim$90\%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. We will make our benchmark publicly available.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1636

Loading