Accurate Prediction of Experimental Band Gaps from Large Language Model-Based Data Extraction

Samuel J. Yang; Shutong Li; Subhashini Venugopalan; Vahe Tshitoyan; Muratahan Aykol; Amil Merchant; Ekin Dogus Cubuk; Gowoon Cheon

Accurate Prediction of Experimental Band Gaps from Large Language Model-Based Data Extraction

Samuel J. Yang, Shutong Li, Subhashini Venugopalan, Vahe Tshitoyan, Muratahan Aykol, Amil Merchant, Ekin Dogus Cubuk, Gowoon Cheon

Published: 27 Oct 2023, Last Modified: 11 Dec 2023AI4Mat-2023 PosterEveryoneRevisionsBibTeX

Submission Track: Papers

Submission Category: AI-Guided Design

Keywords: Large language models, data mining, experimental band gap

TL;DR: We use LLMs to extract experimental band gaps, and show this dataset achieves a 15% reduction in the MAE of predicted band gaps over SoTA

Abstract: Machine learning is transforming materials discovery by providing rapid predictions of material properties, which enables large-scale screening for target materials. However, such models require training data. While automated data extraction from scientific literature has potential, current auto-generated datasets often lack sufficient accuracy and critical structural and processing details of materials that influence the properties. Using band gap as an example, we demonstrate Large language model (LLM)-prompt-based extraction yields an order of magnitude lower error rate. Combined with additional prompts to select a subset of experimentally measured properties from pure, single-crystalline bulk materials, this results in an automatically extracted dataset that's larger and more diverse than the largest existing human-curated database of experimental band gaps. Compared to the existing human-curated database, we show the model trained on our extracted database achieves a 19% reduction in the mean absolute error of predicted band gaps. Finally, we demonstrate that LLMs are able to train models predicting band gap on the extracted data, achieving an automated pipeline of data extraction to materials property prediction.

Submission Number: 32

Loading