Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in handling a wide range of tasks within the software engineering domain, but their ability to perform code migration—adapting code to different environments—remains underexplored. In this work, we propose a novel benchmark, \OurDATA{}: \underline{\textbf{Code}} \underline{\textbf{M}}igration Across \underline{\textbf{Env}}ironment, designed to evaluate LLMs' performance in handling code migration tasks. The benchmark comprises 922 data points across 19 Python and Java packages, offering three tasks to systematically evaluate code migration: identifying version-incompatible functions, determining function changes, and adapting code to target environments.
Experimental evaluation of \OurDATA{} across seven LLMs revealed an average pass@1 rate of 26.50\%, with \textsc{GPT-4o} performing best at 43.84\%.
We highlight our key findings as follows: (i) LLMs are more familiar with newer function versions, making them better at migrating legacy code, and (ii) a logical inconsistency where LLMs sometimes identify irrelevant function changes for the target migration environment.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: code generation and understanding
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 1412
Loading