Student Lead Author Indication: Yes
Keywords: hallucination, large language models, corpus creation, benchmarking, language resources, NLP datasets, evaluation methodologies
TL;DR: This paper studies LLM hallucinations in response to real-world healthcare queries from diverse medical topics and proposed a dataset and hallucination detection framework using LLMs in healthcare.
Abstract: Large language models (LLMs) are prone to hallucinations, generating plausible yet factually incorrect or fabricated information. As LLM-powered chatbots become popular for health-related queries, non-experts risk receiving hallucinated health advice. This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients. We introduce MEDHALU, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of hallucination types and text spans. Furthermore, we propose MedHaluDetect, a comprehensive framework for evaluating LLMs' abilities to detect hallucinations. We study the vulnerability to medical hallucinations among three groups -- medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4.
Submission Number: 3
Loading