T2J: Leveraging Developer Bug-Fixing Behaviors to Evaluate and Improve LLM-Based PyTorch-to-JAX Translation

Hung D Phan; Son Le Vu; Ali Jannesari

T2J: Leveraging Developer Bug-Fixing Behaviors to Evaluate and Improve LLM-Based PyTorch-to-JAX Translation

Hung D Phan, Son Le Vu, Ali Jannesari

20 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Translation, Large Language Model, In Context Learning

TL;DR: This work proposed T2J, an in context learning approach to improve abilities of LLMs in Pytorch to JAX code translation

Abstract: While Large Language Models (LLMs) have shown strong performance in code-to-code translation for widely-used programming languages, their application to PyTorch-to-JAX translation remains challenging. Although both frameworks are implemented in Python, they differ fundamentally in design principles, execution models, and ecosystem maturity, with JAX being relatively new and underrepresented in public code repositories. Moreover, the lack of parallel PyTorch-JAX datasets and the limitations of existing evaluation metrics hinder effective cross-framework translation. In this work, we propose \ourtool, a prompt augmentation framework aimed at improving LLM-based PyTorch-to-JAX translation. First, we construct two PyTorch code datasets, the problem solving code dataset collected from \textit{TorchLeet} \citep{torchleet2025} repository and the Github code dataset collected from \textit{CodeParrot} benchmark \citep{codeparrot2022}, leveraging the cheap LLM 4o-mini to generate initial translations. Second, we employ two professional developers to iteratively fix the generated JAX code until it is functionally equivalent to the original PyTorch code, resulting in a curated \textit{fixed-bug dataset} that captures common translation errors and their corresponding fixes. Third, we design augmented prompts that incorporate structured guidance from the fixed-bug dataset to improve translation quality of lightweight LLMs as GPT-4o-mini. Finally, we take advantages of using LLM as a judge and using LLM to measure the scale of each bug fixing step to propose three evaluation metrics for Pytorch-to-JAX code translation: \ourtoolnospace\_CodeTrans\_Score, \ourtoolnospace\_FixCost\_Score, and \ourtoolnospace\_Comparison\_Score. Our results demonstrate that \ourtool significantly improves GPT-4o-mini performance by up to \textbf{10\%} in CodeBLEU, \textbf{50\%} in \ourtoolnospace\_FixCost\_Score, \textbf{1.33 point} in \ourtoolnospace\_CodeTrans\_Score (as scale of 0-4), and \textbf{100\%} in \ourtoolnospace\_Comparison\_Score. T2J's generated code can improve 2.5 faster in running time compared to the baseline's output execution. Replication package is available at: \url{https://tinyurl.com/yradutma}.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 22903

Loading