AgentHard: Hardening LLM-Agent Evaluation with a Taxonomy of Artifacts and Automated Cleaning

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agent benchmark, Benchmark filtering
Abstract: Reliable evaluation of LLM-based agents is often confounded by artifacts that conflate model errors with benchmark flaws, thereby misrepresenting the agents' true capabilities. To address this, we present a component-wise taxonomy of common benchmark pitfalls spanning the user, environment, evaluation, and ground truth elements of agent tasks. This analysis exposes pervasive issues such as incorrect ground-truth action sequences, ambiguous tool APIs, user simulation faults, and brittle evaluation metrics. Guided by these insights, we develop AgentBenchCleaner, an automated pipeline in which the first two stages filter out flawed tasks: first, rule-based detectors catch deterministic errors; second, an LLM-as-a-judge identifies nuanced issues; and third, a secondary difficulty-based curation step enhances evaluation rigor. Applying the issue-filtering stages yields an issue-cleaned benchmark that removes pervasive artifacts and supports more trustworthy evaluation. The difficulty-based curation step produces a harder derivative, AgentHard-Bench, with standardized evaluation protocols and explicit quality criteria. Across diverse LLM agents, evaluations on AgentHard-Bench deliver more stable model rankings, clearer performance separations, and improved benchmark diversity relative to the original benchmarks. We will release AgentHard-Bench, along with the taxonomy and pipeline upon acceptance, to support robust, reproducible agent evaluation.
Primary Area: datasets and benchmarks
Submission Number: 8036
Loading