Abstract: We introduce CHEF, a novel Comparative Hallucination Evaluation Framework that leverages the HaluEval2.0 LLM-in-the-loop hallucination detection pipeline to directly measure the relative effectiveness of hallucination mitigation techniques, specifically retrieval-augmented generation (RAG) and fine-tuning. While HaluEval2.0 provides absolute hallucination scores using a single evaluator LLM, CHEF demonstrates that by evaluating an identical model architecture across three distinct configurations, we can effectively attribute the resulting differences in hallucination rates to each specific technique. Our experiments across science, biomedical, and other domains, conducted using CHEF, reveal variable effectiveness of both RAG and fine-tuning approaches, with significant domain-dependent performance differences. Offering valuable and actionable insights into mitigation strategies.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: fine-tuning, LLM/AI agents, retrieval-augmented generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: english
Submission Number: 7468
Loading