VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

Zhihuan Jiang; Zhen Yang; Jinhao Chen; Zhengxiao Du; Weihan Wang; Bin Xu; Jie Tang

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Jie Tang

27 Sept 2024 (modified: 13 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Large Language Model, Scientific Reasoning, Benchmark

Abstract: Multi-modal large language models (MLLMs) have shown promise in integrating textual and visual information to handle complex visual understanding tasks. However, most benchmarks evaluating MLLMs focus mainly on mathematics or general visual understanding, revealing a significant gap in assessing capabilities across other critical scientific disciplines like physics and chemistry. To bridge this gap, we meticulously construct a comprehensive benchmark, \textbf{VisScience}, to evaluate multi-modal scientific reasoning across mathematics, physics, and chemistry. This benchmark comprises 3,000 questions drawn from K12 education, from elementary to high school levels, evenly distributed with 1,000 questions per discipline. VisScience encompasses 21 distinct subjects, classified into five difficulty levels to cover a wide range of topics within each discipline. We utilize VisScience to conduct a detailed evaluation of 25 representative MLLMs in scientific reasoning. The experimental results show that closed-source MLLMs generally surpass open-source models, with standout performances including a 53.4\% accuracy in mathematics by Claude3.5-Sonnet, 38.2\% in physics by GPT-4o, and 47.0\% in chemistry by Gemini-1.5-Pro. These results underscore the strengths and limitations of MLLMs, suggesting areas for future improvement and highlighting the importance of developing models that can effectively handle the diverse demands of multi-modal scientific reasoning.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9998

Loading