Exploring Organic Syntheses through Natural Language

Published: 27 Oct 2023, Last Modified: 08 Dec 2023AI4Mat-2023 SpotlightEveryoneRevisionsBibTeX
Submission Track: Papers
Submission Category: AI-Guided Design + Automated Chemical Synthesis
Keywords: chemical space, exploration, large language models, organic synthesis, dataset
Supplementary Material: pdf
TL;DR: We propose methods for exploring the chemical space at th level of natural language
Abstract: Chemists employ a number of levels of abstraction for describing objects and communicating ideas. Most of this knowledge is in the form of natural language, through books, articles and oral explanations, due to its flexibility and capacity to connect the different levels of abstraction. Despite of this, machine-learning chemical models are typically limited to low-level abstractions like graph representations or dynamic point clouds that, although powerful, ignore important aspects like procedural details. In this work, we propose methods for exploring the chemical space at the rich level of natural language. In this setting, synthetic procedure paragraphs are split into segments in four possible classes, and are subsequently mapped into a latent space where they can be conveniently studied. We explore the structure of this space, and find interesting connections with experimental realisation that are beyond the scope of commonly used reaction SMILES. This work aims to draw a path towards LLM-based data processing and chemical space exploration, by analyzing chemical data in previously inaccessible ways that will ultimately allow for better understanding of materials design.
Digital Discovery Special Issue: Yes
Submission Number: 74
Loading