Keywords: agent benchmarks, verifiable rewards, enterprise workflows, con- straint optimization, ERP systems
TL;DR: Using constraint solvers to generate verifiable agent training and evaluation environments for real-world enterprise workflows
Abstract: AI agents are beginning to complete valuable, long-horizon business operations tasks,
but training and evaluation environments for enterprise work still struggle to balance realism,
verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when
instructions, environments, oracles, and verifiers are created by loosely coupled processes, they
frequently disagree on what a task requires, producing environments that are
unsolvable, reward-hackable, or inconsistent. We introduce Anchor,
a task-generation pipeline that formalizes domain experts' specifications of business
workflows into constraint optimization programs. From a single parametric
specification, the pipeline jointly produces a natural-language instruction, environment
configuration, solver-certified ground-truth solution, and state-based verifier. With
Anchor, altering parameters yields new tasks with controlled difficulty and known
optimal solutions, producing harness-agnostic environments whose rewards depend solely
on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of
300 long-horizon tasks spanning procurement and manufacturing workflows in a
production-grade ERP system. We find that generation
parameters predict realized difficulty, and that frontier models satisfy explicit task
constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of
trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building
auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading