Information Extraction from Diverse Charts In Materials Science

Defne Circi; Miles Bradley; Sam Blouir; Boris Wilthan; Antonios Anastasopoulos; Amarda Shehu; L. Catherine; Bhuwan Dhingra

Information Extraction from Diverse Charts In Materials Science

Defne Circi, Miles Bradley, Sam Blouir, Boris Wilthan, Antonios Anastasopoulos, Amarda Shehu, L. Catherine, Bhuwan Dhingra

Published: 31 Jul 2025, Last Modified: 24 Aug 2025LM4SciEveryoneRevisionsBibTeXCC BY 4.0

Keywords: information extraction, scientific charts, materials science

TL;DR: This paper explores the challenges and advancements in using transformer-based models to extract materials science data from scientific figures, introducing new benchmarks and fine-tuning techniques to improve performance.

Abstract: The rapid advancements in machine learning necessitate parallel improvements in the size and quality of domain-specific datasets, especially in fields like materials science, where such datasets are often lacking due to the unstructured nature of real-world information. Despite the wealth of knowledge generated in this domain, much of it remains underutilized as experimental data is often buried in charts. In this paper, we curate two new benchmarks and introduce Relative Coordinate-Label Similarity (RCLS), a novel metric for measuring the state-of-the-art in extracting materials science data from scientific figures. We find that existing pretrained image-to-text Transformer based models for chart-to-table translation struggle with the diverse and complex nature of materials science figures, leading to issues such as inconsistent extraction of axis labels, irregular presentation of tabular data, and the omission of critical elements like legend labels from charts. We introduce two new and in-domain datasets and finetune LLaMA 3.2-Vision 11B and Qwen2.5-VL-7B on them to enhance their performance. Our study focuses on two subdomains of materials science, demonstrating both the successes and ongoing challenges in using multimodal models to extract scientific chart data.

Submission Number: 29

Loading