Submission Track: LLMs for Materials Science - Full Paper
Submission Category: All of the above
Keywords: LLM, Benchmark, MLM, Multimodal
TL;DR: We present a dataset for benchmarking multimodal language models.
Abstract: We present MaCBench, a multimodal benchmark for evaluating AI models in chemistry and materials science tasks.
This benchmark addresses the lack of comprehensive, domain-specific evaluation tools for multimodal AI in scientific contexts. MaCBench encompasses tasks across three key areas: fundamental scientific understanding, data extraction from visual information, and practical laboratory knowledge, totaling 628 questions. It includes diverse visual inputs such as laboratory images, band structures, crystal structures, and atomic force microscopy images paired with multiple-choice questions. We evaluate state-of-the-art multimodal AI models (GPT4-o, Claude-3.5-Sonnet, Gemini-1.5-Pro) on MaCBench, revealing significant performance variations across tasks and skills.
While models excel at basic pattern recognition and information retrieval, they struggle with complex reasoning and applying scientific principles to novel situations. Notably, we observe a disconnect between object recognition and contextual understanding in laboratory safety scenarios. MaCBench provides crucial insights into the capabilities and limitations of multimodal AI in chemistry and materials science, serving as a valuable tool for guiding the development of more capable AI systems for scientific research.
AI4Mat Journal Track: Yes
Submission Number: 50
Loading