The Reliability Gap in Agentic Evidence Verification for Materials Science
Keywords: LLM, agents, materials science, benchmarking, datasets, scientific workflows, evaluation, superconductivity
Abstract: The explosive growth of scientific literature has created an urgent need for autonomous agents capable of rigorous evidence verification. However, existing evaluation frameworks largely rely on sanitized oracle settings (e.g. pre-parsed PDFs) or controlled knowledge environments (e.g. curated corpus), which abstracts away the noise and complexity of real-world research. We develop a framework for evaluating agentic evidence verification under realistic conditions, grounding our study on a set of 200 open-source papers from the SuperCon database. We select materials science as our testbed because it demands exact numerical grounding and binary factual certainty---strict constraints that expose model reliability issues. We develop our evaluation criteria on accuracy and reliability with material science experts, focusing on two core workflows: Multi-Property Extraction from raw, multimodal PDFs, and Open-World Precedent Search on the live web, explicitly testing the ability to verify "negative results". Systematic study of general-purpose agents reveals a sobering reality: models often identify relevant properties or materials but fail to reliably extract precise values or ensure that answer correctness is grounded with valid attribution---highlighting a fundamental reliability gap in current systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 49
Loading