LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature
Keywords: materials synthesis, large language models, data extraction, scientific literature mining, multi-modal learning, vision language models, synthesis ontology, materials informatics, figure digitization, inorganic materials, automated extraction, synthesis prediction
TL;DR: This paper presents a multi-modal AI toolbox that uses LLMs and VLMs to automatically extract and structure synthesis procedures from 81k materials science papers into a standardized, machine-readable database for accelerating materials discovery.
Abstract: The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.
Submission Track: Findings, Tools & Open Challenges
Submission Category: AI-Guided Design + Automated Synthesis
Supplementary Material: pdf
Institution Location: {Paris, France}, {Atlanta, USA}, {Lyon, France}, { Santa Barbara, USA}, {Zurich, Switzerland}, {Baltimore, USA}, {Davis, USA}, {Cambridge, USA}, {New Delhi, India}
AI4Mat Journal Track: Yes
AI4Mat RLSF: Yes
Submission Number: 108
Loading