PROBE: PROcess-Based BEnchmark for Hallucination Detection

Yu Zhang; Peter Belcak; Shizhe Diao; Yonggan Fu; Shaona Ghosh; Morteza Mardani; Eileen Margaret Peters Long; Bei Yu; Pavlo Molchanov

PROBE: PROcess-Based BEnchmark for Hallucination Detection

Yu Zhang, Peter Belcak, Shizhe Diao, Yonggan Fu, Shaona Ghosh, Morteza Mardani, Eileen Margaret Peters Long, Bei Yu, Pavlo Molchanov

Published: 02 Mar 2026, Last Modified: 30 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM hallucination, process-based evaluation, hallucination detection, fine-grained benchmark

Abstract: Hallucination detection remains a significant challenge for large language models. Existing agentic applications rely on LLMs to self-assess the factuality of their outputs using single-step ``LLM-as-a-judge'' prompts. However, even when equipped with ground truth information, current LLMs still fall short in detecting hallucinations, and this one-shot evaluation offers neither the transparency nor the granularity needed to diagnose where and why the detection fails. To address this gap, we introduce PROBE (Process-based Benchmark for Hallucination Detection), a comprehensive benchmark that breaks down hallucination detection into four critical steps: claim decomposition, evidence finding, evidence evaluation, and hallucination localization, and evaluates each step individually. PROBE consists of 12,000 test cases across three task types-summarization, question answering, and style transfer. Critically, we demonstrate that when hallucination detection is treated as a multi-step process, all models achieve considerably better performance. Through extensive evaluation, we show that current LLMs struggle chiefly with evidence finding, and that finetuning on our released training data substantially improves performance on this step. PROBE represents a significant step toward more transparent, diagnosable, and robust hallucination detection systems.

Submission Number: 17

Loading