What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross; Florian Bordes; Adina Williams; Polina Kirichenko; Mark Ibrahim

What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: multimodal, vision language, reasoning, hallucination, benchmark, cognitive

TL;DR: we release a cognitively-inspired benchmark for reasoning across scenes that reveals hallucination is an open challenge for multimodal models

Abstract: Multimodal language models possess a remarkable ability to handle an open-vocabulary worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O Bench with more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking ``what’s in common?''. We evaluate leading multimodal language models, including models specifically trained to reason. We find that perceiving objects in single images is easy for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35\% on Common-O Bench---and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1\%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/facebook/Common-O

Supplementary Material: pdf

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1056

Loading