Fortifying Hallucination Detection to Out-of-Domain Data

Fortifying Hallucination Detection to Out-of-Domain Data

ICLR 2026 Conference Submission19571 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NLP, Hallucination, Hallucination Detection, Uncertainty Quantification, Machine Learning

TL;DR: We propose a data-driven solution to improve robustnes of supervised hallucination detectors in the out-of-domain setting.

Abstract: Hallucinations remain one of the major barriers to the reliable deployment of Large Language Models (LLMs). Recent works have explored both supervised classification based approaches and unsupervised metric based approaches, with the latter remaining popular since they do not require labeled data. However, unsupervised methods lag behind supervised ones for in-domain data, despite having slightly better performance out of domain, as we show across 11 datasets and 10 models. This underscores the importance of supervised approaches, but also highlights their weakness in generalizing to unseen domains. To narrow this generalization gap, we introduce a simple approach to make supervised hallucination detectors more generalizable by relying on a curated, multi-domain training mix, which can complement subsequent addition of task-specific data. In our experiments on hallucination detection on 691K QA samples from 11 open source QA datasets, we show that incorporating this general training allows supervised methods to surpass unsupervised metric based methods by an average of +7.25% on out of domain data, without addition of any task-specific data. We also analyze scaling behaviors and estimate how much task-specific data is required to achieve reliable performance, finding that models augmented with general data require up to 40.3% less task-specific data to achieve close to optimal performance. Together, our findings highlight both the brittleness of existing supervised hallucination detectors and a simple path toward fortifying them detection against domain shift.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19571

Loading