ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: Research agents; agent evaluation; benchmarks; closed-loop scientific discovery; tool-using LLMs; autonomous experimentation; long-horizon reliability; reproducibility; containerized environments; evaluation harnesses; capability–reliability gap; agent scaffolding
TL;DR: A benchmark and execution environment for research agents to ideate, implement and iteratively improve on real research tasks under objective evaluation. We find that LLMs can autonomously outperform human researchers, but do so highly unreliably.
Abstract: We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but \emph{withhold} the paper’s proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability–reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7\%) by 11.5\%, and completes only 26.5\% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. Research Gym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 214
Loading