Keywords: agent skills, LLM agents, evaluation, LLM-as-Judge, Skill Lift, multi-agent evaluation, continuous evaluation, CI/CD, procedural knowledge
TL;DR: ACES evaluates agent skills with paired live-agent trials, measuring whether a skill improves runtime behavior beyond scan-only review.
Abstract: Agent skills extend LLM agents through natural-language instructions, scripts, examples, and references that agents load on demand. Current repository review gates often scan these artifacts with structural checks, LLM-as-Judge rubrics, script linting, and security rules. Such gates are useful, but they do not answer the runtime question: does a live agent discover the skill, use it correctly, and perform better than it would without that skill?
We present ACES (Agentic Continuous Evaluation of Skills), a repository-native, trajectory-aware framework for evaluating skills as executable agent artifacts. ACES runs paired live trials for each task: a with-skill condition and a matched baseline in which the target skill is withheld while configured support and decoy skills remain fixed. The resulting trajectories are normalized into the Agent Trajectory Interchange Format (ATIF) and graded by six default metrics: security, skill execution, skill efficiency, accuracy, goal accuracy, and behavior check. Their paired deltas yield Skill Lift, an estimate of the target skill's added value for a fixed task, harness, workspace, and grading policy.
On 145 real skills from internal enterprise repositories and public skill catalogs, scan-only gates surface real authoring issues but measure complementary facets: structural and LLM-judge scores correlate at Spearman $\rho=0.14$. Across the live-trial surface, ACES exercised 89 unique skill variants. For the main production-skill analysis we exclude scaling stress variants and report 947 paired task cases from 64 production skills and four primary harnesses. Mean overall Skill Lift is 0.2134 (95% paired-case CI [0.1967, 0.2301]), with positive lift in 72.8% of paired cases; the largest mean gains appear in skill execution, behavior check, and skill efficiency, which are runtime-process signals that document scans cannot observe. ACES also makes evaluation assets first-class: author-owned datasets, trajectory-grounded refinement, expected_behavior checks, BYOT tasks, BYOG graders, and CI reporting. We additionally describe a product-first operating mode that applies the same paired protocol to bundles, teams, and plugins while reserving Skill Lift for skill-centered comparisons.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 58
Loading