PROBE: PROcess-Based BEnchmark for Hallucination Detection

Published: 02 Mar 2026, Last Modified: 05 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM hallucination, process-based evaluation, hallucination detection, fine-grained benchmark
Abstract: Hallucination detection remains a significant challenge for large language models. Existing agentic applications rely on LLMs to self-assess the factuality of their outputs using single-step ``LLM-as-a-judge'' prompts. However, even when equipped with ground truth information, current LLMs still fall short in detecting hallucinations, and this one-shot evaluation offers neither the transparency nor the granularity needed to diagnose where and why the detection fails. To address this gap, we introduce PROBE (Process-based Benchmark for Hallucination Detection), a comprehensive benchmark that breaks down hallucination detection into four critical steps: claim decomposition, evidence finding, evidence evaluation, and hallucination localization, and evaluates each step individually. PROBE consists of 12,000 test cases across three task types-summarization, question answering, and style transfer. Critically, we demonstrate that when hallucination detection is treated as a multi-step process, all models achieve considerably better performance. Through extensive evaluation, we show that current LLMs struggle chiefly with evidence finding, and that finetuning on our released training data substantially improves performance on this step. PROBE represents a significant step toward more transparent, diagnosable, and robust hallucination detection systems.
Submission Number: 17
Loading