Abstract: Code translation aims to translate a piece of code from its source language to the target language.
It is widely used in different software development scenarios such as software migration, multilingual development, and system refactoring.
With the rapid advancement of Large Language Models (LLMs), researchers have begun applying them to code translation.
However, the scarcity of parallel corpora hinders models from learning semantic and syntactic alignment knowledge across programming languages.
To address this issue, we propose a data augmentation method that leverages LLMs to automatically generate snippet-alignment data,
which can provide more fine-grained syntactic alignment knowledge than program-alignment data.
In addition, we also explore two effective training approaches to consistently enhance model performance by leveraging snippet-alignment data.
Experiments on the widely used programming languages Python, Java, and C++ demonstrate that
our augmented snippet-alignment data and training approaches can lead to further performance improvements compared to fine-tuning only on program-alignment data.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: code models,fine-tuning
Contribution Types: Data resources, Data analysis
Languages Studied: Python,Java,C++
Submission Number: 6855
Loading