ATLAS: Benchmarking and Adapting Large Language Models for Global Trade via Harmonized Tariff Code Classification
Keywords: large language models, hierarchical reasoning, benchmark, domain adaptation, fine-tuning, trade compliance, tariff classification, HTS code, LLaMA, structured prediction
TL;DR: ATLAS is a benchmark and fine-tuned LLM for Harmonized Tariff Code classification, achieving 40% 10-digit accuracy and 57.5% 6-digit accuracy, outperforming GPT-5 while being 5× cheaper.
Abstract: Accurate classification under the Harmonized Tariff Schedule (HTS) is a critical yet underexplored problem in global trade compliance, where errors can delay shipments and disrupt supply chains. We present ATLAS, the first benchmark and fine-tuned large language model for HTS code prediction, constructed from the U.S. Customs Rulings Online Search System (CROSS). The benchmark in- cludes 18,731 legally grounded rulings spanning 2,992 unique codes, reformat- ted into reasoning-oriented prompts. Our fine-tuned ATLAS model (LLaMA-3.3- 70B) achieves 40% accuracy at the full 10-digit level and 57.5% at the 6-digit level—improvements of +15 and +27.5 points over strong baselines—while be- ing approximately 5× cheaper to deploy. These results establish HTS classifi- cation as a rigorous benchmark for hierarchical reasoning, cost-efficient adapta- tion, and alignment in domain-specialized large language models. The dataset and model are publicly released to encourage further research on structured reasoning for real-world compliance tasks
Submission Number: 25
Loading