CinePile: A Long Video Question Answering Dataset and Benchmark

Ruchit Rawal; Khalid Saifullah; Ronen Basri; David Jacobs; Gowthami Somepalli; Tom Goldstein

CinePile: A Long Video Question Answering Dataset and Benchmark

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein

Published: 09 Apr 2024, Last Modified: 22 Apr 2024SynData4CVEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Datasets and benchmarking, Video understanding, Multi-modal learning, Visual question answering, Long-form video, Metrics and benchmarks

Abstract:

Current datasets for long-form video understanding often fall short in providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, \logan, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs and building upon human-generated raw data. Our comprehensive dataset comprises 200,000 multiple-choice questions (MCQs), covering a diverse range of visual and multimodal aspects, including temporal comprehension, understanding of human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent advances in video-centric LLMs, both open-source and proprietary, using the evaluation split of our dataset. The findings reveal that even state-of-the-art vision LLMs significantly lag behind human performance in these tasks, highlighting the challenges inherent to video understanding.

Submission Number: 17

Loading