Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature

Viviane Torres da Silva; Alexandre Rademaker; Krystelle Lionti; Ronaldo Giro; Geisa Lima; Sandro Rama Fiorini; Marcelo Archanjo; Breno W S R Carvalho; Rodrigo Neumann Barros Ferreira; Anaximandro Souza; João Pedro Gandarela de Souza; Gabriela de Valnisio; Carmen Paz; Renato Cerqueira; Mathias B Steiner

Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature

Viviane Torres da Silva, Alexandre Rademaker, Krystelle Lionti, Ronaldo Giro, Geisa Lima, Sandro Rama Fiorini, Marcelo Archanjo, Breno W S R Carvalho, Rodrigo Neumann Barros Ferreira, Anaximandro Souza, João Pedro Gandarela de Souza, Gabriela de Valnisio, Carmen Paz, Renato Cerqueira, Mathias B Steiner

Published: 08 Oct 2024, Last Modified: 03 Nov 2024AI4Mat-NeurIPS-2024EveryoneRevisionsBibTeXCC BY 4.0

Submission Track: LLMs for Materials Science - Full Paper

Submission Category: Automated Material Characterization

Keywords: LLM, knowledge extraction, synthesis details, reticular material, scientific literature

TL;DR: Exploring the use of open-source LLMs to extract knowledge from scientific literature

Abstract: Automated knowledge extraction from scientific literature can potentially accelerate materials discovery. We have investigated an approach for extracting synthesis protocols for reticular materials from scientific literature using large language models (LLMs). To that end, we introduce a Knowledge Extraction Pipeline (KEP) that automatizes LLM-assisted paragraph classification and information extraction. By applying prompt engineering with in-context learning (ICL) to a set of open- source LLMs, we demonstrate that LLMs can retrieve chemical information from PDF documents, without the need for fine-tuning or training and at a reduced risk of hallucination. By comparing the performance of five open-source families of LLMs in both paragraph classification and information extraction tasks, we observe excellent model performance even if only few example paragraphs are included in the ICL prompts. The results show the potential of the KEP approach for reducing human annotations and data curation efforts in automated scientific knowledge extraction.

AI4Mat Journal Track: Yes

Submission Number: 56

Loading