PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA

Lanlan Ji; Dominic Seyler; Gunkirat Kaur; Manjunath Hegde; Koustuv Dasgupta; Bing Xiang

PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA

Lanlan Ji, Dominic Seyler, Gunkirat Kaur, Manjunath Hegde, Koustuv Dasgupta, Bing Xiang

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination Detection, Benchmark Dataset, Finance, Large Language Models, Long Context

Abstract: While Large Language Models (LLMs) show great promise, their tendencies to hallucinate pose significant risks in high-stakes domains like finance, especially when used for regulatory reporting and decision-making. Existing hallucination detection benchmarks fail to capture the complexities of financial benchmarks, which require high numerical precision, nuanced understanding of the language of finance, and ability to handle long-context documents. To address this, we introduce PHANTOM, a novel benchmark dataset for evaluating hallucination detection in long-context financial QA. Our approach first generates a seed dataset of high-quality "query-answer-document (chunk)" triplets, with either hallucinated or correct answers - that are validated by human annotators and subsequently expanded to capture various context lengths and information placements. We demonstrate how PHANTOM allows fair comparison of hallucination detection models and provides insights into LLM performance, offering a valuable resource for improving hallucination detection in financial applications. Further, our benchmarking results highlight the severe challenges out-of-the-box models face in detecting real-world hallucinations on long context data, and establish some promising directions towards alleviating these challenges, by fine-tuning open-source LLMs using PHANTOM.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/seyled/Phantom_Hallucination_Detection

Code URL: https://huggingface.co/datasets/seyled/Phantom_Hallucination_Detection/tree/main/notebook

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 1074

Loading