T2J: Leveraging Developer Bug-Fixing Behaviors to Evaluate and Improve LLM-Based PyTorch-to-JAX Translation
Keywords: Code Translation, Large Language Model, In Context Learning
TL;DR: This work proposed T2J, an in context learning approach to improve abilities of LLMs in Pytorch to JAX code translation
Abstract: While Large Language Models (LLMs) have shown strong performance in code-to-code translation for widely-used programming languages, their application to PyTorch-to-JAX translation remains challenging. Although both frameworks are implemented in Python, they differ fundamentally in design principles, execution models, and ecosystem maturity, with JAX being relatively new and underrepresented in public code repositories. Moreover, the lack of parallel PyTorch-JAX datasets and the limitations of existing evaluation metrics hinder effective cross-framework translation. In this work, we propose \ourtool, a prompt augmentation framework aimed at improving LLM-based PyTorch-to-JAX translation. First, we construct two PyTorch code datasets, the problem solving code dataset collected from \textit{TorchLeet} \citep{torchleet2025} repository and the Github code dataset collected from \textit{CodeParrot} benchmark \citep{codeparrot2022}, leveraging the cheap LLM 4o-mini to generate initial translations. Second, we employ two professional developers to iteratively fix the generated JAX code until it is functionally equivalent to the original PyTorch code, resulting in a curated \textit{fixed-bug dataset} that captures common translation errors and their corresponding fixes. Third, we design augmented prompts that incorporate structured guidance from the fixed-bug dataset to improve translation quality of lightweight LLMs as GPT-4o-mini. Finally, we take advantages of using LLM as a judge and using LLM to measure the scale of each bug fixing step to propose three evaluation metrics for Pytorch-to-JAX code translation: \ourtoolnospace\_CodeTrans\_Score, \ourtoolnospace\_FixCost\_Score, and \ourtoolnospace\_Comparison\_Score. Our results demonstrate that \ourtool significantly improves GPT-4o-mini performance by up to \textbf{10\%} in CodeBLEU, \textbf{50\%} in \ourtoolnospace\_FixCost\_Score, \textbf{1.33 point} in \ourtoolnospace\_CodeTrans\_Score (as scale of 0-4), and \textbf{100\%} in \ourtoolnospace\_Comparison\_Score. T2J's generated code can improve 2.5 faster in running time compared to the baseline's output execution. Replication package is available at: \url{https://tinyurl.com/yradutma}.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 22903
Loading