Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

ICLR 2026 Conference Submission18623 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, long video understanding, memory consolidation

TL;DR: A benchmark for long video understanding

Abstract: Despite recent progress in vision-language models (VLMs), holistic understanding of long-form video content remains a significant challenge, partly due to limitations in current benchmarks. Many focus on peripheral, "needle-in-a-haystack" details, encouraging context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we introduce $\textbf{MF$^2$}$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information---requiring integration of both visual and linguistic modalities---from full-length movies ($\textbf{50-170 minutes long}$). MF$^2$ includes over 50 full-length, $\textbf{open-licensed}$ movies, each paired with manually constructed sets of claim pairs---one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as $\textbf{character motivations}$ and $\textbf{emotions}$, $\textbf{causal chains}$, and $\textbf{event order}$, and refer to $\textbf{memorable moments}$ that humans can recall without rewatching the movie. Instead of multiple-choice formats, we adopt a binary claim evaluation protocol: for each pair, models must correctly identify both the true and false claims. This reduces biases like answer ordering and enables a more precise assessment of reasoning. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance, underscoring the relative ease of the task for humans and their superior ability to retain and reason over critical narrative information---an ability current VLMs lack.

Primary Area: datasets and benchmarks

Submission Number: 18623

Loading