Keywords: Large Language Models, Environment Design, Benchmarking, Evaluation, Agents.
Abstract: The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks remain the primary tool for assessing model capabilities, but they quickly become saturated. In contrast, **dynamic benchmarks** evolve alongside the models they evaluate, yet they are expensive to create and maintain.
To address these challenges, we introduce **BeTaL (Benchmark Tuning with an LLM-in-the-loop)**, a framework that uses environment-design principles to **_automate dynamic benchmark construction_**. BeTaL parameterizes key design choices in base benchmark templates and leverages LLMs to reason over this parameter space to achieve desired properties such as difficulty and realism, all in a cost-efficient manner.
We validate BeTaL by targeting specific difficulty levels across tasks. Using this framework, we create two new benchmarks and extend the popular agentic benchmark **τ-Bench**. Extensive evaluation across the three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from **5.3% to 13.2%**—a **2–4× improvement** over baseline methods.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21837
Loading