Mitigating Hallucinations in LLMs for International Trade: Introducing the TradeGov Evaluation Dataset and TradeGuard Hallucination Mitigation Framework for Trade Q&A

Published: 13 Dec 2025, Last Modified: 16 Jan 2026AILaw26EveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Trade, LLM, Regulations, Hallucination
Paper Type: Full papers
TL;DR: Paper introduces the first dataset and framework - called TradeGov and TradeGuard respectively - for evaluating LLM on trade related Q&A and reducing hallucinations in the same.
Abstract: Given the constant flux in the world of geopolitics, staying up to date and compliant with international trade issues is challenging. But exploring if LLMs can aid this task is a frontier hither to unexplored in the LLM evaluation literature - primarily due to the lack of a dataset set for benchmarking the capabilities of LLMs on questions regarding international trade subjects. To address this gap, we introduce TradeGov - a novel, human audited dataset containing 5k international trade related question-answer pairs across 138 countries, created using ChatGPT based on the Country Commercial Guides on the International Trade Administration website. The dataset achieves 98% relevance and faithfulness and doesn’t show any systematic biases along macroeconomic and geographical dimensions, lending itself to equal applicably for LLM assessment across countries. Testing the performance of ChatGPT-4o and Claude Sonnet 3.5 on this dataset - marking the first systematic evaluation of LLMs for answering questions about international trade - we find that ChatGPT-4o achieves 85% accuracy while Claude Sonnet 3.5 achieves 88% accuracy. Building on these insights, we develop TradeGuard - an ensemble trade regulation hallucination mitigation framework that leverages majority vote summarization and multi-agent debate to achieve 91% accuracy on the TradeGov dataset, outperforming vanilla versions of Claude and ChatGPT. TradeGuard’s ensemble hallucination detection algorithm — combining entailment verification, cross-questioning, and Bayesian regression—achieves an F1 score of 91%, significantly enhancing reliability in legal contexts. Notably, we demonstrate that TradeGuard reduces "I don’t know" responses while maintaining accuracy, particularly for low-income countries and demonstrates no systematic biases along key macroeconomic dimensions.
Poster PDF: pdf
Submission Number: 23
Loading