Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

Published: 28 Mar 2026, Last Modified: 28 Mar 2026AIware 2026EveryoneRevisionsCC BY 4.0
Keywords: Large Language Models, Code Translation, Program Correctness
TL;DR: This work present a study of false errors during code translation using LLM
Abstract: Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frame- works has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus) using GPT-4o. Our analysis identifies and categorizes common false negatives, revealing that standard evaluation pipelines frequently misclassify semantically correct translations. Our findings highlight the necessity for transparent, configuration-aware evaluation standards to accurately assess progress in LLM-based code translation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Reroute: true
Submission Number: 54
Loading