Track: Technical
Keywords: Hallucination Mitigation, Language Model Hallucination, Hallucination Detection Framework, LLM Evaluation Metrics, Ethical AI Development, Factual Consistency in LMs, Robustness in Language Models, Model Accountability, Transparency in Language Models, Auditing LMs, AI Fairness and Equity, Answer Faithfulness
TL;DR: THaMES is a framework for mitigating and evaluating hallucinations in LLMs. It automates testset creation, offers benchmarking, and applies mitigation strategies like ICL, RAG, and PEFT.
Abstract: Hallucination, the generation of factually incorrect and confabulated content, is a rising issue in the realm of Large Language Models (LLMs). While hallucination detection and mitigation methods exist, they are largely isolated and often inadequate for domain-specific use cases. There is no standardized pipeline combining the necessary components of domain-pertinent dataset generation, hallucination detection benchmarking, and mitigation strategies into one tool. This paper proposes the THaMES framework and library---a Tool for Hallucination Mitigations and EvaluationS. THaMES is an end-to-end solution that evaluates and mitigates hallucinations in LLMs through automated testset generation, multifaceted benchmarking techniques, and flexible mitigation strategies. The THaMES framework is capable of automating testset generation from any corpus of information while achieving high data quality and diversity, maintaining cost-effectiveness through batch processing, weighted sampling, counterfactual validation, and the usage of complex question types. THaMES can also evaluate a model’s capability to identify hallucinations and generate less hallucinated outputs across multiple types of evaluation tasks, including text generation and binary classification. The framework also applies optimal hallucination mitigation strategies tailored to different models and knowledge bases. THaMES contains a variety of hallucination mitigation strategies, including In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluating a variety of state-of-the-art LLMs using a knowledge base consisting of academic papers, political news articles, and Wikipedia articles, we find that commercial models such as GPT-4o benefit more from RAG strategies than ICL, and that while open weight models such as Llama-3.1-8B-Instruct and Mistral-Nemo also show improvements with RAG mitigations, they benefit more from the reasoning provided by ICL. In an experiment with open weight model Llama-3.1-8B-Instruct, the PEFT mitigation significantly improved over the base model in aspects of both evaluation tasks.
Submission Number: 42
Loading