Hallucinations in Code to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

ACL ARR 2025 May Submission7292 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language models have demonstrated remarkable capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving both code and natural language is underexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them.Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations, predominantly manifesting as input inconsistency, logic inconsistency, or intention violation. Whilst commonly used metrics are weak detectors on their own (56.6\% and 61.7\% ROC-AUC, respectively), combining multiple metrics substantially improves performance (69.1\% and 75.3\% respectively). Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection. Our work calls for future research on more robust detection methods and hallucination-resistant models for code to natural language generation.~\footnote{All code and data will be released upon acceptance.}
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code to natural language generation, hallucination, automated code review, commit message generation, AI4SE
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 7292
Loading