Automated Snippet-Alignment Data Augmentation for Code Translation

Automated Snippet-Alignment Data Augmentation for Code Translation

ACL ARR 2025 May Submission6855 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Code translation aims to translate a piece of code from its source language to the target language. It is widely used in different software development scenarios such as software migration, multilingual development, and system refactoring. With the rapid advancement of Large Language Models (LLMs), researchers have begun applying them to code translation. However, the scarcity of parallel corpora hinders models from learning semantic and syntactic alignment knowledge across programming languages. To address this issue, we propose a data augmentation method that leverages LLMs to automatically generate snippet-alignment data, which can provide more fine-grained syntactic alignment knowledge than program-alignment data. In addition, we also explore two effective training approaches to consistently enhance model performance by leveraging snippet-alignment data. Experiments on the widely used programming languages Python, Java, and C++ demonstrate that our augmented snippet-alignment data and training approaches can lead to further performance improvements compared to fine-tuning only on program-alignment data.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: code models,fine-tuning

Contribution Types: Data resources, Data analysis

Languages Studied: Python,Java,C++

Submission Number: 6855

Loading