Copy-on-Write Scoring: Application-Specific Agent Evaluations

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM agents, evaluation, Copy-on-Write, benchmarking, software systems
TL;DR: We introduce Copy-on-Write Scoring, a framework enabling evaluation of LLM agents directly within their own applications by isolating agent writes from production data and producing session- and operation-level diagnostic scores.
Abstract: Trustworthy deployment of LLM-based agents in software systems requires evaluating how they perform on application-specific workflows, with enough granularity to localize where they succeed and fail. Yet existing agent evaluation mechanisms are limited: benchmarks have low construct validity for application-specific workflows and environments, and replica evaluation environments are expensive and prone to drift. We propose \textbf{Copy-on-Write (CoW) Scoring}\footnote{Python library: \href{https://anonymous.4open.science/r/agent-cow-python-F219}{\texttt{agent-cow}}}, a framework that evaluates agents directly within software application environments using a database-level Copy-on-Write mechanism to isolate and evaluate agent writes. CoW Scoring produces session- and operation-level scores that highlight where agents succeed and fail in a given application environment, enabling inexpensive evaluation and iteration on agent harnesses and tool surfaces. We demonstrate the framework on Plane, an open-source project-management platform, where analysis surfaced specific issues in the tool surface, and corresponding fixes produced measurable improvements on affected models. Library code: https://anonymous.4open.science/r/agent-cow-python-F219
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 153
Loading