Keywords: multimodal understanding, long-form video-language understanding
TL;DR: We introduce HourVideo, a benchmark dataset designed to rigorously evaluate the capabilities of multimodal models to comprehend one-hour-long videos.
Abstract: We present **HourVideo**, a benchmark dataset for one hour video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (*recall*, *tracking*), visual reasoning (*spatial*, *temporal*, *predictive*, *causal*, *counterfactual*), and navigation (*room-to-room*, *object retrieval*) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features **12,976 high-quality, five-way multiple-choice questions**. Benchmarking results reveal that multimodal models, including GPT-4V and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0\% vs. 37.3\%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu.
Supplementary Material: zip
Submission Number: 351
Loading