Probing Logical Reasoning of MLLMs in Scientific Diagrams

Yufei Wang, Adriana Kovashka

Published: 04 Nov 2025, Last Modified: 04 May 2026EMNLP 2025EveryoneRevisionsCC BY 4.0

Abstract: We examine how multimodal large language models (MLLMs) perform logical inference grounded in visual information. We first construct a dataset of food web/chain images, along with questions that follow seven structured templates with progressively more complex reasoning involved. We show that complex reasoning about entities in the images remains challenging (even with elaborate prompts) and that visual information is underutilized.