ConvFaithEval: Evaluating Faithfulness of Large Language Models in Real-World Customer Service Conversations

ACL ARR 2024 December Submission1667 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) excel in diverse tasks but are prone to hallucinations. Most existing benchmarks primarily focus on evaluating factual hallucinations, while the assessment of faithfulness hallucinations remains underexplored, especially with practical conversations that involve casual language and topic shifts. To bridge this gap, we introduce ConvFaithEval, the first faithfulness hallucination evaluation benchmark built on real-world customer service conversations. ConvFaithEval features 3,369 anonymized online conversations, with over 88% involving multiple topics. During the evaluation, two tasks, Conversation Summarization and Quiz Examination, are designed to comprehensively assess faithfulness hallucinations in 23 LLMs. Extensive experiments reveal that faithfulness hallucinations persist across all LLMs, with closed-source LLMs consistently are affected better open-source counterparts. To mitigate these hallucinations, we further explore four strategies and offer valuable insights for the future development of advanced methods.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Hallucination, Benchmark, Large Language Models
Contribution Types: Data resources
Languages Studied: Chinese
Submission Number: 1667
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview