SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints

Zekun Li; Shinda Huang; Jiangtian Wang; Nathan Zhang; Antonis Antoniades; Wenyue Hua; Kaijie Zhu; Sirui Zeng; Chi Wang; William Yang Wang; Xifeng Yan

SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints

Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, Xifeng Yan

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Agent, Tool Call, Function Calling, Standard Operating Procedures, Instruction Following, Constraint Following, Jailbreak, SOP, Sandbox Environment, Automatic Data Generation, Automatic Evaluation

TL;DR: We introduce an automated evaluation pipeline producing a benchmark covering 7 domains with over 900 cases that evaluate language agents' adherence to standard operating procedures and constraints, proving challenging for leading proprietary models.

Abstract: As language agents increasingly automate critical tasks, their ability to follow domain-specific standard operating procedures (SOPs), policies, and constraints when taking actions and making tool calls becomes essential yet remains underexplored. To address this gap, we develop an automated evaluation pipeline with: (1) sandbox environments containing 167 executable tools/functions across seven customer service domains with 70 service-specific, verifiable SOPs and constraints, (2) an automated test generation framework producing over 800 verified test cases, and (3) an evaluation harness to rigorously assess agent adherence. Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions correctly based on natural-language SOP descriptions. The SOP code serves as oracle verifiers to assess compliance from multiple dimensions, reducing reliance on manual or LLM-based evaluations. Our benchmark covers seven custmor service domains with over 800 test cases. We evaluate 18 leading models and find the task remains challenging even for top-tier reasoning models such as o4-mini-high, with pass rates around 30% on certain difficult domains. Other powerful non-reasoning models perform worse than reasoning models, and smaller models (<32B) show limited capability. Additionally, language agents can be easily jailbroken to overlook SOPs and constraints. Code, data, and over 24k agent trajectories are released.

Primary Area: datasets and benchmarks

Submission Number: 13390

Loading