Submission Track: Short Paper
Submission Category: AI-Guided Design
Keywords: information extraction, natural language processing, materials discovery
TL;DR: Extraction of Compositions from Texts of Material Science Articles using language models
Abstract: Materials composition, property, processing, testing and applications are important parts of materials tetrahedron. Due to variations in composition reporting styles in the text of materials science research papers, extracting them becomes a challenging task. To address this challenge, we present an end-to-end pipeline essential for creating and completing the materials science (MatSci) knowledge bases(KBs). The proposed approach involves creating an automated training dataset using distant supervision and rule-based extraction. This dataset was used to train models for identifying sentences (performed well), reporting the composition, and extracting the composition(performed poorly). To improve the performance of the extraction model, two steps were taken: first, generating additional training using GPT-4, and second, classifying the composition reporting styles in text. This dataset was then used to train the FLAN-T5 language model to extract the compositions from the text. We also compared the performance of our approach with GPT-4 and observed that the performance is quite the same for the cases where the compositions are mentioned in the text in a simplified way. For the cases where composition is reported in the form of equations which require solving arithmetic expressions and substitutions, our proposed model has 14.7% better F1-score than GPT-4.
Submission Number: 17