HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application

ACL ARR 2026 January Submission5277 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: HSCode Benchmark, Deep Search Agent, Precise Multi-Turn Deep Reasoning, Hierarchical Rule Application
Abstract: Current agent benchmarks focus on open-domain web navigation or structured data utilization. However, they neglect a critical capability required for numerous domain-specific applications (e.g., legal, medical and e-commerce): hierarchical rule application, where agents must strictly adhere to expert-written rules with implicit logic and vague boundaries. To bridge this gap, we introduce \textsc{HSCodeComp}, a realistic benchmark requiring agents to assign 10-digit Harmonized System Codes (HSCode) to commercial products based on official tariff classification rules and noisy product descriptions. Sourced from real-world large-scale e-commerce platforms, \textsc{HSCodeComp} comprises 632 product entries spanning diverse categories, with ground-truth HSCodes rigorously annotated by domain experts. Evaluations across 23 state-of-the-art LLMs and agents reveal a huge performance gap: best agent achieves only 46.8\% 10-digit accuracy, significantly lagging behind human experts at 95.0\%. Crucially, detailed analysis demonstrates the challenges of hierarchical rule application: standard test-time scaling strategies fail to yield improvements and excessive reasoning steps degrade accuracy. Codes and the benchmark will be publicly released.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Resources and Evaluation,Language Modeling
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 5277
Loading