The Reliability Gap in Agentic Evidence Verification for Materials Science

Albert Gong; James J. Kim; Anmol Kabra; Aaditya Panigrahi; Jiashuo Wang; Arjun B. Mulchandani; Michael Freeman; Fatmagul Katmer; Joshua Peters Wakefield; Linxi Zhao; Chao Wan; Akanksha Sarkar; Yoav Artzi; Leslie M Schoop; John Thickstun; Kilian Q Weinberger; Eun-Ah Kim; Peter I. Frazier; Jennifer J. Sun

The Reliability Gap in Agentic Evidence Verification for Materials Science

Albert Gong, James J. Kim, Anmol Kabra, Aaditya Panigrahi, Jiashuo Wang, Arjun B. Mulchandani, Michael Freeman, Fatmagul Katmer, Joshua Peters Wakefield, Linxi Zhao, Chao Wan, Akanksha Sarkar, Yoav Artzi, Leslie M Schoop, John Thickstun, Kilian Q Weinberger, Eun-Ah Kim, Peter I. Frazier, Jennifer J. Sun

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0

Keywords: LLM, agents, materials science, benchmarking, datasets, scientific workflows, evaluation, superconductivity

Abstract: The explosive growth of scientific literature has created an urgent need for autonomous agents capable of rigorous evidence verification. However, existing evaluation frameworks largely rely on sanitized oracle settings (e.g. pre-parsed PDFs) or controlled knowledge environments (e.g. curated corpus), which abstracts away the noise and complexity of real-world research. We develop a framework for evaluating agentic evidence verification under realistic conditions, grounding our study on a set of 200 open-source papers from the SuperCon database. We select materials science as our testbed because it demands exact numerical grounding and binary factual certainty---strict constraints that expose model reliability issues. We develop our evaluation criteria on accuracy and reliability with material science experts, focusing on two core workflows: Multi-Property Extraction from raw, multimodal PDFs, and Open-World Precedent Search on the live web, explicitly testing the ability to verify "negative results". Systematic study of general-purpose agents reveals a sobering reality: models often identify relevant properties or materials but fail to reliably extract precise values or ensure that answer correctness is grounded with valid attribution---highlighting a fundamental reliability gap in current systems.

PDF: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 49

Loading