Knowledge Graph Extraction from Total Synthesis Documents

Published: 08 Jul 2024, Last Modified: 23 Jul 2024AI4Mat-Vienna-2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Full Paper
Submission Category: AI-Guided Design + Automated Chemical Synthesis
Keywords: Knowledge graph, total synthesis, data extraction, benchmark
TL;DR: A benchmark for data extraction of organic syntheses from papers is proposed, along with LLM-based data extraction algorithms
Abstract: Knowledge graphs (KGs) have emerged as a pow- erful tool for organizing and integrating complex information, making it a suitable format for sci- entific knowledge. However, translating scientific knowledge into KGs is challenging as a wide va- riety of styles and elements to present data and ideas is used. Although efforts for KG extraction (KGE) from scientific documents exist, evalua- tion remains challenging and field-dependent; and existing benchmarks do not focuse on scientific information. Furthermore, establishing a general benchmark for this task is challenging as not all scientific knowledge has a ground-truth KG repre- sentation, making any benchmark prone to ambi- guity. Here we propose Graph of Organic Synthe- sis Benchmark (GOSyBench), a benchmark for KG extraction from scientific documents in chem- istry, that leverages the native KG-like structure of synthetic routes in organic chemistry. We de- velop KG-extraction algorithms based on LLMs (GPT-4, Claude, Mistral) and VLMs (GPT-4o), the best of which reaches 73% recovery accuracy and 59% precision, leaving a lot of room for im- provement. We expect GOSyBench can serve as a valuable resource for evaluating and advancing KGE methods in the scientific domain, ultimately facilitating better organization, integration, and discovery of scientific knowledge
Submission Number: 19
Loading