Keywords: Multi-Modal Large Language Model, Ambiguity, Benchmark, Dataset
TL;DR: We propose a benchmark to evaluate the performance of current MLLMs in ambiguity contexts, and the results demonstrate that current MLLMs averagely lag behind human performance by about 36.85%
Abstract: While visual information in multimodal settings can naturally help resolve inherent ambiguities in natural language, the ability of multimodal large language models (MLLMs) to leverage visual cues for disambiguation remains underexplored. In this paper, we introduce the benchmark specifically designed to evaluate the performance of MLLMs in Ambiguous contexts (MMA). MMA uses a multiple-choice visual question-answering format with a novel evaluation protocol in which each ambiguous text is paired with two distinct images that suggest different scenarios. This setup requires models to provide different correct answers based on the visual context, effectively testing their ability to perform cross-modal disambiguation. By evaluating 25 proprietary and open-sourced MLLMs, we find that: (1) MLLMs often overlook scenario-specific information provided by images to clarify the ambiguity of texts. When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 53.22% in answering both correctly, compared to human performance at 88.97%. (2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Proprietary models (e.g., Gemini 2.0 Pro, top performer at 78.9%) outperform open-source counterparts by an average margin of 16.78%. These findings firstly underscore the current limitations of MLLMs in integrating visual information to clarify textual ambiguities and highlight critical areas for future improvements. The codes and benchmark data are https://github.com/physicsru/mma
Submission Number: 98
Loading