AIE-Bench: Benchmarking Agents That Build Agents

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent benchmarking, self-improvement, meta-improvement, agentic systems, scaffold optimization, language model agents, AI for AI, iterative modification, agent evaluation, recursive self improvement
TL;DR: AIE Bench is the first benchmark for measuring whether an AI agent can improve another AI agent, filling a gap no existing benchmark addresses.
Abstract: We introduce AIE Bench, a benchmark for measuring how well AI agents can build and improve other AI agents. Existing benchmarks evaluate whether an agent can solve tasks. This benchmark aims to measure whether an agent can modify another agent to make it better at those tasks. AIE Bench is built around two roles. A meta-agent proposes modifications, and a target-agent that is being improved. This setup covers meta-improvement, where one agent improves another, and self-improvement, where an agent improves itself. We instantiate AIE Bench across two task families spanning terminal interaction and tool calling, and we evaluate frontier agentic systems on their ability to drive gains through iterative modification. AIE Bench aims to make recursive agent improvement a measurable and reproducible research target.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 136
Loading