Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity
Track: AI for Science
Keywords: LLMs, RAG, scientific literature, QA, superconductivity, HCI
TL;DR: Evaluating if LLMs can leverage prior knowledge and really understand scientific literature in high temperature superconductivity
Abstract: Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers grounded in experimental results within specialized domains remains an active area of research. This work evaluates the performance of six LLM-based systems for answering scientific queries, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in high-temperature cuprate superconductors, a research area involving material science, experimental physics, computation, and theoretical physics using a set of 67 expert-formulated queries. In particular, we compare the responses from models that address queries based on training and web search against models that answer based on expert-curated literature: a database of 1726 scientific papers. We use a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. We discuss the promising aspects of LLM performance and the models' critical shortcomings. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems.
Serve As Reviewer: ~Subhashini_Venugopalan2, ~Haoyu_Guo5
Submission Number: 21
Loading