Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

ACL ARR 2025 July Submission1379 Authors

29 Jul 2025 (modified: 18 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving {code changes which have a structurally complex and context-dependent format of code remains largely unexplored.} This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code {change} to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnote{All code and data will be released upon acceptance.}

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: LLM hallucinations, code change to natural language, automated code review, automated commit message generation

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 1379

Loading