Verifying Agents in Rubric-Graded Environments

Markus Dücker; Vaibhav Kumar; Yi Liu; Ronak Chaudhary; Andreas Plesner; Francisco Guzmán; Anish Athalye

Verifying Agents in Rubric-Graded Environments

Markus Dücker, Vaibhav Kumar, Yi Liu, Ronak Chaudhary, Andreas Plesner, Francisco Guzmán, Anish Athalye

Published: 23 May 2026, Last Modified: 26 May 2026ACM CAIS 2026: RLEval Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agentic verification, agents-as-a-judge, LLM-as-a-judge, rubric-based evaluation, agent benchmarks, verifiers

Abstract: As AI agents take on open-ended tasks, verifying their outputs is increasingly difficult. *Rubric-graded agent environments*—where a verifier judges multiple natural-language criteria against the agent's deliverables and environment state—have emerged as a popular paradigm. We conduct the first systematic study of verifiers for such environments. First, we create BankerVerifierBench (BVB), a meta-evaluation dataset of $3{,}204$ human-judged criteria across $21$ investment-banking tasks. Next, we derive verifier requirements directly from the rubric corpus, yielding a nine-capability taxonomy that we distill into three design principles—reactive verification, environment alignment, and domain guidance—which we implement in Gandalf, an open-source verifier. Finally, we evaluate Gandalf on BVB: its cheapest configuration (F1 $0.633$, 42 USD) is Pareto optimal, exceeding the most expensive baseline (F1 $0.538$, 414 USD) by $9.5$ points at one-tenth the cost.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 18

Loading