Abstract: Recent advancements in multi-modal large language models (MLLMs) have demonstrated promising capabilities in integrating visual and textual information to solve complex problems. While many of these models exhibit strong performance in mathematics or general vision tasks, it remains unclear whether they possess the scientific reasoning skills necessary to tackle challenges across diverse domains such as physics and chemistry. In this work, we aim to bridge this gap by introducing a new benchmark, VisScience, designed to systematically evaluate MLLMs on multi-disciplinary scientific reasoning. Our benchmark consists of 3,000 carefully curated questions spanning K12 education, with equal representation from mathematics, physics, and chemistry (1,000 problems each). These questions are drawn from 21 subject areas and are categorized into five difficulty levels to reflect a broad range of curricular concepts and reasoning demands. With our VisScience, we analyze MLLMs on scientific reasoning by evaluating 25 representative models, including both open-source and closed-source variants. Our results show that MLLMs's performance varies notably across disciplines—while models generally perform best on mathematics, physics and chemistry questions expose weaknesses in scientific abstraction and visual grounding. Furthermore, we examine model behaviors under multilingual settings, as \data is provided in both English and Chinese, enabling a cross-linguistic perspective on scientific reasoning.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodal Reasoning, Scientific Reasoning
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 7501
Loading