Keywords: Coding Agents, Large Language Models, Agent Evaluation, Interactive Environment
TL;DR: We build USACOArena, a competitive programming arena to evaluate coding agents' decision-making skills under resource constraints, revealing strategic profiles that go beyond simple code correctness.
Abstract: Contemporary coding-agent benchmarks applaud “first correct answer”, silently assuming infinite tokens, container minutes, and developer patience. In production, every LLM call, test re-run, and rollback incurs hard cost; agents that cannot budget these resources are dead on arrival. We close the gap with USACOArena, an ICPC-inspired arena where agents pay deterministic credits for every prompt, compilation, test, or rollback. A task becomes a cost–benefit negotiation under uncertainty: is a second sample worth 15% of the remaining budget, or should the agent pivot to a cheaper heuristic? Real-time deduction exposes decision profiles hidden from static leaderboards: the tax of over-specialized generators, the ROI of early-exit heuristics, and the compound interest of lightweight scaffolding. Even identically seeded agents diverge in self-play, revealing a rich policy space where the same model oscillates between spendthrift submission sprees and parsimonious exploration. Released as a reproducible benchmark and zero-shot curriculum, USACOArena provides the traces, credit engine, and six state-of-the-art decision logs to catalyze research on coding agents that know when to stop.
Submission Type: Research Paper (4-9 Pages)
Submission Number: 34
Loading