MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

ACL ARR 2025 May Submission5491 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Scientific figure interpretation is crucial for AI scientific assistants built on Large Vision Language Models, yet current datasets mainly cover restricted scientific domains, and limited figure complexity (like charts). We address this gap with a comprehensive dataset from peer-reviewed Nature Communications articles spanning 72 scientific fields, featuring complex visualizations that require graduate-level expertise to interpret. Evaluation of 19 proprietary and open-source models on figure captioning and multiple-choice tasks, alongside human expert annotation, revealed significant performance gaps. Beyond benchmarking, our dataset enables effective large-scale training. Fine-tuning Qwen2-VL-2B with our data outperformed GPT-4o and human experts in multiple-choice tasks, while continuous pre-training on interleaved article-figure data enhanced downstream performance in materials science. We will release this dataset to support further research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Scientific Figure Understanding, Large Vision Language Model, Multimodal Large Language Model, Evaluation, Benchmark, Materials Science, Multi-discipline, Multimodal, Scientific knowledge understanding, Nature science
Contribution Types: Data resources
Languages Studied: English
Submission Number: 5491
Loading