Keywords: Coding Agents, Large Language Models, Agent Evaluation, Interactive Environment
TL;DR: We build USACOArena, a competitive programming arena to evaluate coding agents' decision-making skills under resource constraints, revealing strategic profiles that go beyond simple code correctness.
Abstract: As autonomous agents and agent swarms become increasingly capable at complex coding tasks, the focus must shift from mere accuracy to real-world efficiency. Unconstrained agents often waste massive resources and time, incurring high opportunity costs. To be practical, agents must manage a generalized "credit" budget covering generated tokens, local tests, and elapsed time. To evaluate this resource awareness, we introduce USACOArena, an interactive ACM-ICPC style arena where every decision consumes credits from a fixed pool. This transforms a static coding benchmark into a dynamic resource management challenge. Our experiments reveal that frontier models—including the Codex framework and leading single agents—struggle to optimally balance these economic constraints. Through competitive profiling and self-play, we uncover distinct, path-dependent decision strategies. Ultimately, enforcing a strict credit economy is essential for developing highly efficient, cost-aware agent swarms.
Primary Area: datasets and benchmarks
Submission Number: 6159
Loading