Keywords: agentic verification, agents-as-a-judge, LLM-as-a-judge, rubric-based evaluation, agent benchmarks, verifiers
Abstract: As AI agents take on open-ended tasks, verifying their outputs is increasingly difficult. *Rubric-graded agent environments*—where a verifier judges multiple natural-language criteria against the agent's deliverables and environment state—have emerged as a popular paradigm. We conduct the first systematic study of verifiers for such environments. First, we create BankerVerifierBench (BVB), a meta-evaluation dataset of $3{,}204$ human-judged criteria across $21$ investment-banking tasks. Next, we derive verifier requirements directly from the rubric corpus, yielding a nine-capability taxonomy that we distill into three design principles—reactive verification, environment alignment, and domain guidance—which we implement in Gandalf, an open-source verifier. Finally, we evaluate Gandalf on BVB: its cheapest configuration (F1 $0.633$, 42 USD) is Pareto optimal, exceeding the most expensive baseline (F1 $0.538$, 414 USD) by $9.5$ points at one-tenth the cost.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 18
Loading