CHEF: a comparative hallucination evaluation framework for large language models

01 Sept 2025 (modified: 27 Oct 2025)Submitted to NeurIPS Lock-LLM Workshop 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: fine-tuning, LLM/AI agents, retrieval-augmented generation
TL;DR: comparative hallucination evaluation framework for large language models
Abstract: We introduce CHEF, a novel Comparative Hallucination Evaluation Framework that leverages the HaluEval2.0 LLM-in-the-loop hallucination detection pipeline to directly measure the relative effectiveness of hallucination mitigation techniques, specifically retrieval-augmented generation (RAG) and fine-tuning. While HaluE-val2.0 provides absolute hallucination scores using a single evaluator LLM, CHEF demonstrates that by evaluating an identical model architecture across three distinct configurations, we can effectively attribute the resulting differences in hallucination rates to each specific technique. Our experiments across science, biomedical, and other domains, conducted using CHEF, reveal variable effectiveness of both RAG and fine-tuning approaches, with significant domain-dependent performance differences. Offering valuable and actionable insights into mitigation strategies.
Submission Number: 13
Loading