A Survey of Automatic Hallucination Evaluation on Natural Language Generation

TMLR Paper5135 Authors

17 Jun 2025 (modified: 25 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The proliferation of Large Language Models (LLMs) has introduced a critical challenge: accurate hallucination evaluation that ensures model reliability. While Automatic Hallucination Evaluation (AHE) has emerged as essential, the field suffers from methodological fragmentation, hindering both theoretical understanding and practical advancement. This survey addresses this critical gap through a comprehensive analysis of 105 evaluation methods, revealing that 77.1% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a structured framework to organize the field, based on a comprehensive survey of foundational datasets and benchmarks and a taxonomy of evaluation methodologies, which together systematically document the evolution from pre-LLM to post-LLM approaches. Beyond taxonomical organization, we identify fundamental limitations in current approaches and their implications for real-world deployment. To guide future research, we delineate key challenges and propose strategic directions, including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, ultimately providing a roadmap for developing more robust and practical hallucination evaluation systems.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Shay_B_Cohen1
Submission Number: 5135
Loading