FIRE-Bench: Evaluating Research Agents on the Rediscovery of Scientific Insights

ICLR 2026 Conference Submission21286 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Research Agents, Autonomous AI Agents, Research Automation, Benchmarking, AI for Science, Scientific Discovery
Abstract: Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery, but rigorously evaluating their capacity for genuine discovery remains a critical challenge. Current evaluation benchmarks face a dilemma: they either rely on LLM-as-judge evaluations of auto-generated papers, which raise concerns about validity and circularity, or focus on optimizing single performance metrics that serve as a coarse proxy for genuine discovery. To address this, we introduce FIRE-Bench (\textbf{F}ull-cycle \textbf{I}nsight \textbf{R}ediscovery \textbf{E}valuation). Our benchmark reframes evaluation by tasking agents with the verifiable rediscovery of established scientific findings from recent, high-impact ML research. We provide agents only with the high-level research question from a published study, requiring them to autonomously design experiments, implement code, execute their plan, and derive a conclusion from the evidence. We evaluate a suite of state-of-the-art agents with frontier model backbones (e.g., GPT-5) on FIRE-Bench. Our findings paint a sobering picture of current capabilities: even the most advanced agents struggle profoundly, exhibiting low success rates, high variance, and a spectrum of recurring failure modes ranging from flawed experimental design to ungrounded conclusions. FIRE-Bench provides a rigorous, diagnostic framework for measuring and driving progress towards AI agents capable of genuine scientific discovery.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21286
Loading