Keywords: LLM; agent skills; skill retrieval; Model Context Protocol
TL;DR: A two-stage retriever (BGE bi-encoder + cross-encoder reranker) exposed over MCP that recovers near-oracle skill-selection quality on SkillsBench, within 1.7 pp of hand-curated bundles and without any per-task curation.
Abstract: Anthropic Agent Skills package reusable procedural know-how into SKILL.md files, but extracting their value at scale requires a curator who reads the pool and picks a small topical bundle for each task. Existing benchmarks score skills as part of a fixed (model, harness) bundle, which leaves the retrieval sub-problem entangled with model and harness choices. We isolate that sub-problem in an open-source, specification-faithful stack (the OpenHands SDK, which implements the AgentSkills specification end-to-end, driving gpt-oss-120b) and ask how much of the hand-curated gain a deterministic plug-and-play retriever can recover. SkillSeek is a CPU-only two-stage retriever (BGE-base bi-encoder followed by a bge-reranker-v2-m3 cross-encoder, top-20 to top-5) exposed as a Model Context Protocol (MCP) server. On the 89-task SkillsBench, SkillSeek reaches 0.467 task-mean pass rate, within 1.7% of the per-task hand-curated oracle bundle (0.484), +8.3% over no-skills (0.384), and +8.0% over a load-everything baseline (0.387); on easy tasks SkillSeek outperforms oracle. The cost is bounded at 1.3 s of CPU-only retrieval latency per call and +15% tokens over no-skills, with no API spend and no end-to-end slowdown. Together, these results suggest that hand-curated skill selection in deployed agent systems can be largely replaced by off-the-shelf retrieval at modest cost. GitHub: https://anonymous.4open.science/r/SkillSeek/
Presentation Mode: Yes, at least one author will attend and present in person.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 71
Loading