Comparing Hallucination Detection Methods for Multilingual Generation

ACL ARR 2024 June Submission2968 Authors

15 Jun 2024 (modified: 28 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While many hallucination detection techniques have been evaluated on English text, their effectiveness in multilingual contexts remains unknown. This paper assesses how well various factual hallucination detection metrics (lexical metrics like ROUGE and Named Entity Overlap, and Natural Language Inference (NLI)-based metrics) identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models. However, NLI metrics are still limited, as they do not detect single-fact hallucinations well and fail for lower-resource languages. Therefore, our findings highlight the gaps in exisiting hallucination detection methods for non-English languages and motivate future research to develop more robust multilingual detection methods for LLM hallucinations.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Large Language Models, Multilingual NLP, Hallucination
Languages Studied: English, Spanish, Russian, Indonesian, Vietnamese, Persian, Ukrainian, Swedish, Thai, Japanese, German, Romanian, Hungarian, Bulgarian, French, Finnish, Korean, Italian, Chinese
Submission Number: 2968
Loading