EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

NeurIPS 2023 Track Datasets and Benchmarks Submission102 Authors

Published: 26 Sept 2023, Last Modified: 06 Feb 2024NeurIPS 2023 Datasets and Benchmarks SpotlightEveryoneRevisionsBibTeX

Keywords: video understanding, long-term understanding, video question answering, vision and language

TL;DR: We propose EgoSchema, a new very long-form video question answering dataset, offering over 5000 multiple-choice questions over which current SOTA models achieve accuracies less than 33% on 0-shot question answering, while humans achieve about 76%.

Abstract: We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that EgoSchema, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code will all be open-sourced under the Ego4D license at http://egoschema.github.io.

Supplementary Material: zip

Submission Number: 102

Loading