Automating Benchmark Design

ICLR 2026 Conference Submission21837 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Environment Design, Benchmarking, Evaluation, Agents.
Abstract: The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, \emph{dynamic benchmarks} evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop \method(Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to \textbf{\emph{automate the process of dynamic benchmark design}}. \method works by parametrizing key design choices in base benchmark templates and uses LLMs to reason through the parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We use our approach to generate a new and challenging spatial reasoning benchmark and to develop new tasks for popular agentic tasks like $\tau$-bench. We carry out extensive experiments on three datasets, at different benchmark target performance (difficulty) levels, and show that \method achieves the lowest performance gap, as low as 0.4\% and up to 5\% in most settings; significantly improving over competing LLM and non-LLM based baselines. These experiments demonstrate that \method opens the door to a new paradigm of self-adaptive, continually improving evaluation systems.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21837
Loading