Keywords: large language models, information extraction, materials data
TL;DR: Multimodal large language models show promise for extracting composition and property data from polymer composite materials science literature.
Abstract: Advances in materials science depend on leveraging data from the vast published literature. Extracting detailed data and metadata from these publications is challenging, leading current data repositories to rely on newly created data in narrow domains. Large Language Models (LLMs) offer a new opportunity to rapidly and accurately extract data and insights from the published literature, transforming it into structured formats for easy querying and reuse. This paper explores using LLMs for autonomous data extraction from materials science articles, focusing on polymer composites to demonstrate successes and challenges in extracting tabular data. We explored different table representations for use with LLMs, finding that a multimodal model with an image input yielded the most promising results. This model achieved an accuracy score of 0.910 for composition information extraction, which includes polymer names, molecule names used as fillers, and their respective compositions. Additionally, it achieved an F$_1$ score of 0.863 for property name information extraction. With the most conservative evaluation for the property extraction requiring exact match in all the details we obtained an F$_1$ score of 0.419. We observed that by allowing varying degrees of flexibility in the evaluation, the score can increase to 0.769. We envision that the results and analysis from this study will promote further research directions in developing information extraction strategies from materials information sources.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 17
Loading