ConvFaithEval: Evaluating Faithfulness of Large Language Models with Real-World Customer Service Conversations

ConvFaithEval: Evaluating Faithfulness of Large Language Models with Real-World Customer Service Conversations

ACL ARR 2025 February Submission5598 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) excel in diverse tasks but are prone to hallucinations. Most existing benchmarks primarily focus on evaluating factual hallucinations, while the assessment of faithfulness hallucinations remains underexplored, especially with practical conversations that involve casual language and topic shifts. To bridge this gap, we introduce \textsc{ConvFaithEval}, the first faithfulness hallucination evaluation benchmark built on real-world customer service conversations. Two tasks, \textit{Conversation Summarization} and \textit{Quiz Examination}, are designed to comprehensively assess faithfulness hallucinations in LLMs. Extensive experiments on 22 LLMs reveal that faithfulness hallucinations persist across all LLMs.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: Hallucination, Benchmark, Large Language Models

Contribution Types: Data resources

Languages Studied: Chinese

Submission Number: 5598

Loading