Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Emmanouil Zaranis; António Farinhas; Saul Santos; Beatriz Canaverde; Miguel Moura Ramos; Wafaa Mohammed; Giuseppe Attanasio; Chrysoula Zerva; Nithin Sivakumaran; Shoubin Yu; Elena Bueno-Benito; Aditya K Surikuchi; Ben Peters; Danae Sanchez Villegas; Andre G. Viveiros; Pavlo Vasylenko; Baohao Liao; Sonal Sannigrahi; Jaehong Yoon; Elias Stengel-Eskin; Mariella Dimiccoli; Oswald Lanz; Alessandro Suglia; Mohit Bansal; Sandro Pezzelle; Stella Frank; Vlad Niculae; Desmond Elliott; Raffaella Bernardi; Raquel Fernández; Andre Martins

Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Published: 02 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: vision-language models, long video understanding, memory consolidation, benchmarks, evaluation

TL;DR: A benchmark for long video (movie) understanding

Abstract: Holistic understanding of long-form video remains a challenge for vision-language models (VLMs). Unfortunately, current benchmarks cannot easily capture this limitation, since they mostly focus on ``needle-in-a-haystack'' details, rewarding context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we address this gap by introducing MF$^2$, a new benchmark to evaluate how well models are able to comprehend, consolidate, and recall key narrative information from full-length movies (**50-170 minutes long**), requiring integration of **both** visual and language modalities. MF$^2$ includes over 50 full-length, **open-licensed** movies, each paired with manually constructed sets of claim pairs---one true (*fact*) and one plausible but false (*fib*), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to **memorable moments** that humans can recall without rewatching the movie. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance. Despite the relative ease of the task for humans who can effectively retain and reason over critical narrative information, current VLMs lack this ability and thus struggle.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 34

Loading