Quantifying and Evaluating Continuity Properties of Multi-modal LLMs

Quantifying and Evaluating Continuity Properties of Multi-modal LLMs

ACL ARR 2026 January Submission668 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: uncertainty estimation, multimodal LLMs, reliability assessment.

Abstract: The recent advances in multimodal large language models (MLLMs), while extending the skills and capabilities of text-only LLMs, have also made the model responses vulnerable to increased hallucinations, reduced contextual awareness and inconsistency in complex reasoning. Most existing works on benchmarking of MLLMs focus on datasets consisting of isolated samples that do not allow the evaluation of continuity and monotonicity properties of these models. In this paper, we develop a synthetic benchmark for evaluating MLLM performance and uncertainty on continually varying dimensions of complexity. The benchmark relies on the core real-world principle that $\textit{inputs of increasing/decreasing ambiguity should ideally lead to higher/lower model uncertainty}$. We experiment with $5$ large vision language models (LVLMs) and $4$ large audio language models (LALMs) across various image question-answering and audio question-answering tasks. Our findings show that most of MLLMs lack the real-world continuity and monotonicity that are human-like.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation,benchmarking,automatic creation and evaluation of language resources,evaluation methodologies, evaluation,metrics

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 668

Loading