Towards Robust Benchmark of Object Hallucination on Multiple Images

08 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, Object hallucination, Benchmark
Abstract: Multimodal Large Language Models (MLLMs) are evolving into sophisticated agentic systems, engaging users in complex, multi-image scenarios. However, current MLLMs are limited by object hallucination, generating information inconsistent with visual evidence. Existing benchmarks, largely designed for single-image settings or offering only high-level multi-image assessments, fail to capture the nuanced causes of object hallucination, particularly under adversarial conditions. To address this, we introduce the Multi-Image Object Hallucination (MIOH) benchmark, a comprehensive framework specifically designed to diagnose MLLM vulnerabilities in complex multi-image contexts. MIOH integrates four object-centric tasks (existence, counting, attribute, position) with four controllable adversarial factors (visual context scale, perceptual difficulty, contextual bias, and misleading textual context). Through our systematic evaluation using MIOH, we reveal that even state-of-the-art models including GPT-5 and Gemini Pro still suffer from significant performance degradation under adversarial conditions, with models showing increased susceptibility to both false positive and false negative hallucinations when visual and linguistic contexts become challenging.
Primary Area: datasets and benchmarks
Submission Number: 2947
Loading