Keywords: Science, Physics, Evaluation, Large Language Models, Benchmark, Agent
TL;DR: The ability of frontier agentic models to perform scientific data analysis at a professional standard were evaluated
Abstract: We introduce SciPro (Scientific Process) Arena, a benchmark that measures how well frontier agentic AI systems analyze scientific data. While current benchmarks test knowledge recall and reasoning, few evaluate whether models can extract patterns from noisy experimental datasets. Using simulated condensed matter data, we prompt models to identify structures from 2D intensity arrays in the face of noise and limited instrumental resolution, scoring them such that a score of 1 indicates complete accuracy, while 0.5 denotes deviation by an extent equivalent to typical experimental error. We test recent reasoning models. The best models score only 0.13 out of 1 averaged across all questions, rising to 0.20 for questions without noise. This pales in comparison to human–written code, which score 0.37 (averaged) and 0.55 (noiseless). Clear failure modes were identified: models can find simple features, but struggle to trace continuous patterns or compute derived quantities. Performance predictably degrades with increased data resolution and noise levels. These results show current frontier models cannot reliably perform scientific data analysis, highlighting a significant gap between current capabilities and practical uses in reinforcement learning–based agents for scientific discovery.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8271
Loading