LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Published: 15 May 2026, Last Modified: 23 May 2026AgentSkills 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LH-Bench, Agent Skills, Long-Horizon Agents, Benchmarking & Evaluation, Subjective Task Evaluation, Rubrics Based Evaluation, Enterprise Workflows, LH-Bench
Abstract: Binary success metrics work when a task has a single correct answer. They fail for long-horizon enterprise work, where agents must coordinate tools over dozens of steps, produce multiple intermediate artifacts, and satisfy subjective process constraints such as design-system discipline, source grounding, and safe iterative editing. In this setting, evaluation requires procedural knowledge, not just an outcome checker. We present LH-Bench, a benchmark and evaluation design in which expert-authored SKILL.md artifacts serve as the bridge between execution and evaluation. Skills encode workflow expectations as observable rubric boundaries, while curated artifact contracts and human preference judgments provide independent validation. We instantiate LH-Bench in two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (183 chapters across 41 courses), and evaluate three autonomous agent harness families end-to-end. Expert-authored skills make LLM judges materially more reliable than LLM-authored rubrics alone (𝜅=0.60 vs. 0.46 on the same runs), and independent human preferences recover the same primary ranking boundary (𝑝<0.05). Skill-level decomposition exposes agent trade-offs that aggregate artifact scores hide, and structured verifier feedback enables recovery from 70.3% of observed execution errors. We release the benchmark artifacts, rubrics, and a source-grounded human reasoning dataset spanning subject matter expert annotations, chapter plans, and pairwise preferences.
Presentation Mode: Yes, at least one author will attend and present in person.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 17
Loading