LLaMat: Large Language Models for Materials Science Information Extraction

LLaMat: Large Language Models for Materials Science Information Extraction

NeurIPS 2024 Workshop AI4Mat Submission87 Authors

Published: 08 Oct 2024, Last Modified: 09 Dec 2024AI4Mat-NeurIPS-2024EveryoneRevisionsBibTeXCC BY 4.0

Submission Track: LLMs for Materials Science - Short Paper

Submission Category: AI-Guided Design

Keywords: large language models, materials discovery, information extraction, table understanding, materials science

TL;DR: Large Language Models for Materials Science Information Extraction

Abstract: Large language models have emerged as an important tool for information extraction and as scientific assistants in materials science and discovery. However, their performance is limited due to a lack of domain expertise. In this work, we propose LLaMat models, namely, LLaMat-2-7B and LLaMat-3-8B, which are obtained by continuously pre-training META's LLaMA-2-7B and LLaMA-3-8B models, respectively, on a large corpus of 30B tokens of materials science text to improve their domain expertise. We also developed LLaMat-Chat models, instruction fine-tuned variants of LLaMat models tailored through a dataset of one million instruction-output pairs, enabling interaction and information extraction abilities for the materials science domain. We show that LLaMat achieves state-of-the-art performance on several information extraction tasks from materials science text, where LLaMat-3-8B emerges as the best model. We also demonstrate the application of the developed model on structured information extraction capabilities of the developed chat models and compare their performance on 4 datasets ranging from named entity and relation extraction from text and understanding composition tables from materials science research papers.

Submission Number: 87

Loading