Keywords: HSCode Benchmark, Deep Search Agent, Precise Multi-Turn Deep Reasoning, Hierarchical Rule Application
Abstract: Current agent benchmarks focus on open-domain web navigation or structured data utilization.
However, they neglect a critical capability required for numerous domain-specific applications (e.g., legal, medical and e-commerce): hierarchical rule application, where agents must strictly adhere to expert-written rules with implicit logic and vague boundaries.
To bridge this gap, we introduce \textsc{HSCodeComp}, a realistic benchmark requiring agents to assign 10-digit Harmonized System Codes (HSCode) to commercial products based on official tariff classification rules and noisy product descriptions.
Sourced from real-world large-scale e-commerce platforms, \textsc{HSCodeComp} comprises 632 product entries spanning diverse categories, with ground-truth HSCodes rigorously annotated by domain experts.
Evaluations across 23 state-of-the-art LLMs and agents reveal a huge performance gap: best agent achieves only 46.8\% 10-digit accuracy, significantly lagging behind human experts at 95.0\%.
Crucially, detailed analysis demonstrates the challenges of hierarchical rule application: standard test-time scaling strategies fail to yield improvements and excessive reasoning steps degrade accuracy.
Codes and the benchmark will be publicly released.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Resources and Evaluation,Language Modeling
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 5277
Loading