CHEF: A Comparative Hallucination Evaluation Framework for Large Language Models

CHEF: A Comparative Hallucination Evaluation Framework for Large Language Models

ACL ARR 2025 May Submission7468 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce CHEF, a novel Comparative Hallucination Evaluation Framework that leverages the HaluEval2.0 LLM-in-the-loop hallucination detection pipeline to directly measure the relative effectiveness of hallucination mitigation techniques, specifically retrieval-augmented generation (RAG) and fine-tuning. While HaluEval2.0 provides absolute hallucination scores using a single evaluator LLM, CHEF demonstrates that by evaluating an identical model architecture across three distinct configurations, we can effectively attribute the resulting differences in hallucination rates to each specific technique. Our experiments across science, biomedical, and other domains, conducted using CHEF, reveal variable effectiveness of both RAG and fine-tuning approaches, with significant domain-dependent performance differences. Offering valuable and actionable insights into mitigation strategies.

Paper Type: Short

Research Area: Language Modeling

Research Area Keywords: fine-tuning, LLM/AI agents, retrieval-augmented generation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: english

Submission Number: 7468

Loading