Evaluating Scientific Reasoning in Multi-modal Large Language Models

Evaluating Scientific Reasoning in Multi-modal Large Language Models

ACL ARR 2025 May Submission7501 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advancements in multi-modal large language models (MLLMs) have demonstrated promising capabilities in integrating visual and textual information to solve complex problems. While many of these models exhibit strong performance in mathematics or general vision tasks, it remains unclear whether they possess the scientific reasoning skills necessary to tackle challenges across diverse domains such as physics and chemistry. In this work, we aim to bridge this gap by introducing a new benchmark, VisScience, designed to systematically evaluate MLLMs on multi-disciplinary scientific reasoning. Our benchmark consists of 3,000 carefully curated questions spanning K12 education, with equal representation from mathematics, physics, and chemistry (1,000 problems each). These questions are drawn from 21 subject areas and are categorized into five difficulty levels to reflect a broad range of curricular concepts and reasoning demands. With our VisScience, we analyze MLLMs on scientific reasoning by evaluating 25 representative models, including both open-source and closed-source variants. Our results show that MLLMs's performance varies notably across disciplines—while models generally perform best on mathematics, physics and chemistry questions expose weaknesses in scientific abstraction and visual grounding. Furthermore, we examine model behaviors under multilingual settings, as \data is provided in both English and Chinese, enabling a cross-linguistic perspective on scientific reasoning.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodal Reasoning, Scientific Reasoning

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 7501

Loading