MultiHaystack: Benchmarking Multimodal Reasoning over 40K Images, Videos, and Documents

MultiHaystack: Benchmarking Multimodal Reasoning over 40K Images, Videos, and Documents

ICLR 2026 Conference Submission13281 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Haystack, Retrieval, VQA

TL;DR: A benchmark used for benchmarking MLLMs on Large-Scale (over 40K) Cross-Modal Retrieval and Reasoning

Abstract: Multimodal large language models (MLLMs) have advanced rapidly on benchmarks involving isolated text, image, or video tasks, but such settings overlook a crucial step in real-world applications: retrieving evidence from large, heterogeneous corpora before reasoning. Existing benchmarks typically provide only hundreds or thousands of candidates, making retrieval trivial and overstating model reliability. To address this gap, we introduce MultiHaystack, the first benchmark for large-scale, realistic cross-modal retrieval and reasoning. It contains over 46,000 documents, images, and videos paired with 747 uniquely verifiable questions, ensuring unambiguous evaluation while requiring both modality selection and fine-grained reasoning. Our experiments reveal a consistent pattern: models perform competitively when directly given the answer-containing file, but their performance drops sharply once evidence must be retrieved at scale. The best retriever (E5-V) achieves only 40.8% Recall@1, while even GPT-5 reaches just 51.4% VQA accuracy under top-5 retrieval. These results reveal that retrieval, rather than reasoning, is the dominant bottleneck, establishing MultiHaystack as a rigorous benchmark that exposes weaknesses hidden by small-scale evaluations and highlights retrieval as the key frontier for advancing MLLMs.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13281

Loading