SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Science, Physics, Evaluation, Large Language Models, Benchmark, Agent
TL;DR: The ability of frontier agentic models to perform scientific data analysis at a professional standard were evaluated
Abstract: In the physical sciences, stringent standards of error tolerance require data analysis to be either rigorous enough to be trusted by other scientists, or be completely disregarded. We introduce SciPro (Scientific Process) Arena, a benchmark that measures how reliably frontier AI systems analyze spectral scientific data. Using simulated condensed matter physics data, we prompt models to identify structures from 2D intensity arrays in the face of noise and limited instrumental resolution, scoring them such that a score of 1 indicates complete accuracy, while 0.5 denotes deviation by an extent equivalent to typical experimental error. We test recent reasoning models. With direct query, the best models score only 0.13 out of 1 averaged across all questions, rising to 0.20 for questions without noise. While enabling access to a Python interpreter improved scores to as much as 0.23 (averaged) and 0.39 (noiseless), these still pale in comparison to human–written code, which score 0.37 and 0.55 respectively. Clear failure modes were identified: models can find simple features, but struggle to trace continuous patterns or compute derived quantities. Performance predictably degrades with increased data resolution and noise levels. These results show current frontier models cannot reliably perform scientific data analysis, highlighting a significant gap between current capabilities and practical uses of LLMs for scientific discovery in physics.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8271
Loading