SkillOpt: Trajectory-Derived, Verifier-Grounded Compilation of LLM-Agent Skills

Published: 15 May 2026, Last Modified: 23 May 2026AgentSkills 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent skills, SKILL.md, skill optimization, verifier-grounded compilation, trajectory-derived optimization, LLM agents, SkillsBench, token efficiency, agent latency, knowledge base, amortized agent cost
TL;DR: Verifier-grounded compiler from no-skill agent trajectories to deployable SKILL.md+scripts. 33/87 SkillsBench wins, median -63% tokens, -62% latency, -40% tool calls at non-regressing reward.
Abstract: LLM-agent "skills" (SKILL.md files plus optional scripts) inject procedural knowledge at inference time, but *how* that knowledge is packaged determines whether the agent treats it as a tool or as homework. We build SkillOpt, a **verifier-grounded same-task skill compiler** that takes a task's instruction, verifier source, and a no-skill trajectory, consults a Markdown knowledge base of learned patterns, emits a script-maximized skill, validates against the verifier, and iterates with verifier feedback when reward regresses. The optimization is per-task and offline; the resulting artifact is amortized over repeated agent runs of the same task and environment, not claimed to generalize to held-out fixtures. We run SkillOpt across $87$ SkillsBench tasks. **It produces $33$ positive-reward wins (skills that earn nonzero reward at non-regressing reward while reducing cost) with median reductions of $40\%$ in tool calls, $63\%$ in tokens, and $62\%$ in wall-clock latency; $13$ of these are $0.0 \to \geq 0.5$ verifier-guided rescues where the no-skill agent failed entirely.** An additional $16$ "efficient failure" tasks reduce cost while reward remains $0$, bringing total efficiency-improving runs to $49$. The knowledge base grows from $5$ seeded patterns to $19$, including domain-specific gotchas the optimizer surfaced (e.g. "Project to metric CRS for geospatial distances"). A small input-ablation study on $9$ stratified tasks separates the contributions of the trajectory and the KB: the full configuration beats KB-off on $9/9$ tasks (lexicographic, reward primary) and beats trajectory-and-KB-off (naive) on $7/9$, with no-KB-vs-naive splitting roughly evenly — so on this sample each input does measurable work, though neither is strictly necessary on every task. We additionally publish skill_optimizer.yaml, a provenance schema that records categorical edits and before/after measurements so future optimizers can publish comparable numbers.
Presentation Mode: Yes, at least one author will attend and present in person.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 81
Loading