Intermediate Representations for Improved Code Translation with LLMs

Amy Tai; Lukasz Golab; Alexander Wong

Intermediate Representations for Improved Code Translation with LLMs

Amy Tai, Lukasz Golab, Alexander Wong

Published: 22 Sept 2025, Last Modified: 27 Nov 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: code translation, large language models, natural language, chain-of-thought prompting, abstract syntax trees

Abstract: Code translation is the task of converting code from one programming language (e.g., Java) to another (e.g., Python) \cite{pan2024lost}. Early statistical efforts \cite{oda2015learning} have recently given way to methods based on large language models (LLMs). However, studies show that LLMs still produce buggy translations using a zero-shot prompt \cite{pan2024lost}. One promising avenue to improve translation accuracy is through intermediate representations, which provide structured guidance for the translation process. We investigate whether LLM-based code translation can benefit from intermediate representations (IRs), specifically in the forms of natural language (NL) summaries and abstract syntax trees (ASTs). We explore two main approaches to incorporate IRs in code translation: (1) a two-step approach (2S), where the LLM first translates the original code to IR and then translates this IR to the target language \cite{ahmad-etal-2023-summarize}; and (2) a chain-of-thought (CoT) prompting approach, where the LLM is instructed to use IR to explain its reasoning during translation. For our experiments, we use two code translation benchmarks: sampled CodeNet \cite{puricodenet} (languages: C, C++, Go, Java, Python) and AVATAR \cite{ahmad2023avatarparallelcorpusjavapython} (languages: Java, Python). In Phase 1, we experiment with different permutations of IRs (NL, AST, or both), and compare with the simple zero-shot (0SP) and one-shot prompt (1SP) baselines using the open-source GPT-4 LLM as the backbone (Open GPT4 8X7B \cite{theblokeopengpt4}) on the sampled CodeNet. Based on Phase 1 results, we evaluated the two highest performing prompts with Open GPT4 8X7B, StarCoder \cite{li2023starcodersourceyou} and CodeGen LLMs in Phase 2. Following \cite{pan2024lost}, we use the percentage of successful translations (successful translated code would compile, pass runtime checks, and pass existing tests) as the performance metric. We found that CoT with NL performed the best, with a 13.8% and 6.7% improvement on the CodeNet and AVATAR dataset, respectively, compared to the initial zero-shot prompt with Open GPT4 8X7B. Our experiments highlight the potential for generalizing our findings and underscore the benefits of using IRs in code translation. Future work includes exploring additional languages and datasets.

Submission Number: 105

Loading