Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity

Haoyu Guo; Maria Tikhanovskaya; Paul Raccuglia; Alexey Vlaskin; Christopher Co; Daniel J. Liebling; Scott Ellsworth; Matthew Abraham; Elizabeth Dorfman; N.P. Armitage; John M. Tranquada; Senthil Todadri; Antoine Georges; Subir Sachdev; Steven Kivelson; B. J. Ramshaw; Chunhan Feng; Olivier Gingras; Vadim Oganesyan; Michael Brenner; Subhashini Venugopalan; Eun-Ah Kim

Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity

Published: 12 Jun 2025, Last Modified: 09 Jul 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: AI for Science

Keywords: LLMs, RAG, scientific literature, QA, superconductivity, HCI

TL;DR: Evaluating if LLMs can leverage prior knowledge and really understand scientific literature in high temperature superconductivity

Abstract: Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers grounded in experimental results within specialized domains remains an active area of research. This work evaluates the performance of six LLM-based systems for answering scientific queries, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in high-temperature cuprate superconductors, a research area involving material science, experimental physics, computation, and theoretical physics using a set of 67 expert-formulated queries. In particular, we compare the responses from models that address queries based on training and web search against models that answer based on expert-curated literature: a database of 1726 scientific papers. We use a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. We discuss the promising aspects of LLM performance and the models' critical shortcomings. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems.

Serve As Reviewer: ~Subhashini_Venugopalan2, ~Haoyu_Guo5

Submission Number: 21

Loading