AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich; Gal Chechik

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich, Gal Chechik

Published: 30 May 2026, Last Modified: 01 Jun 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: benchmark, ablation, AI coscientist, language models, agents

TL;DR: We introduce AblationBench, a benchmark suite consisting of two tasks used to evaluate LMs for ablation planning in empirical AI research

Abstract: Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 45% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach.

Submission Number: 10

Loading