Keywords: Hallucination Detection, Benchmark Dataset, Finance, Large Language Models, Long Context
Abstract: While Large Language Models (LLMs) show great promise, their tendencies to hallucinate pose significant risks in high-stakes domains like finance, especially when used for regulatory reporting and decision-making. Existing hallucination detection benchmarks fail to capture the complexities of financial benchmarks, which require high numerical precision, nuanced understanding of the language of finance, and ability to handle long-context documents. To address this, we introduce PHANTOM, a novel benchmark dataset for evaluating hallucination detection in long-context financial QA. Our approach first generates a seed dataset of high-quality "query-answer-document (chunk)" triplets, with either hallucinated or correct answers - that are validated by human annotators and subsequently expanded to capture various context lengths and information placements. We demonstrate how PHANTOM allows fair comparison of hallucination detection models and provides insights into LLM performance, offering a valuable resource for improving hallucination detection in financial applications. Further, our benchmarking results highlight the severe challenges out-of-the-box models face in detecting real-world hallucinations on long context data, and establish some promising directions towards alleviating these challenges, by fine-tuning open-source LLMs using PHANTOM.
Croissant File:  json
Dataset URL: https://huggingface.co/datasets/seyled/Phantom_Hallucination_Detection
Code URL: https://huggingface.co/datasets/seyled/Phantom_Hallucination_Detection/tree/main/notebook
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1074
Loading