SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints

ICLR 2026 Conference Submission13390 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Agent, Tool Call, Function Calling, Standard Operating Procedures, Instruction Following, Constraint Following, Jailbreak, SOP, Sandbox Environment, Automatic Data Generation, Automatic Evaluation
TL;DR: We introduce an automated evaluation pipeline producing a benchmark covering 7 domains with over 900 cases that evaluate language agents' adherence to standard operating procedures and constraints, proving challenging for leading proprietary models.
Abstract: As language agents increasingly automate critical tasks, their ability to follow domain-specific standard operating procedures (SOPs), policies, and constraints when taking actions and making tool calls becomes essential yet remains underexplored. To address this gap, we develop an automated evaluation pipeline with: (1) sandbox environments containing 167 executable tools/functions across seven customer service domains with 70 service-specific, verifiable SOPs and constraints, (2) an automated test generation framework producing over 800 verified test cases, and (3) an evaluation harness to rigorously assess agent adherence. Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions correctly based on natural-language SOP descriptions. The SOP code serves as oracle verifiers to assess compliance from multiple dimensions, reducing reliance on manual or LLM-based evaluations. Our benchmark covers seven custmor service domains with over 800 test cases. We evaluate 18 leading models and find the task remains challenging even for top-tier reasoning models such as o4-mini-high, with pass rates around 30% on certain difficult domains. Other powerful non-reasoning models perform worse than reasoning models, and smaller models (<32B) show limited capability. Additionally, language agents can be easily jailbroken to overlook SOPs and constraints. Code, data, and over 24k agent trajectories are released.
Primary Area: datasets and benchmarks
Submission Number: 13390
Loading