SciPro Arena: a Case Study of AI Agent Capabilities in Scientific Analysis Tasks

ICLR 2026 Conference Submission8271 Authors

17 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Science, Physics, Evaluation, Large Language Models, Benchmark, Agent
TL;DR: The ability of frontier agentic models to perform scientific data analysis at a professional standard were evaluated
Abstract: We introduce SciPro (Scientific Process) Arena, a benchmark that measures how well frontier agentic AI systems analyze scientific data. While current benchmarks test knowledge recall and reasoning, few evaluate whether models can extract patterns from noisy experimental datasets. Using simulated condensed matter data, we prompt models to identify structures from 2D intensity arrays in the face of noise and limited instrumental resolution, scoring them such that a score of 1 indicates complete accuracy, while 0.5 denotes deviation by an extent equivalent to typical experimental error. We test recent reasoning models. The best models score only 0.13 out of 1 averaged across all questions, rising to 0.20 for questions without noise. This pales in comparison to human–written code, which score 0.37 (averaged) and 0.55 (noiseless). Clear failure modes were identified: models can find simple features, but struggle to trace continuous patterns or compute derived quantities. Performance predictably degrades with increased data resolution and noise levels. These results show current frontier models cannot reliably perform scientific data analysis, highlighting a significant gap between current capabilities and practical uses in reinforcement learning–based agents for scientific discovery.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8271
Loading