Summarize and Generate to Back-translate: Unsupervised Translation of Programming LanguagesDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available. In this approach, a source-to-target model is coupled with a target-to-source model and trained in parallel. While the target-to-source model generates noisy sources, the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Therefore, it is compelling to train them to build programming language translation systems via back-translation. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we suggest performing back-translation via code summarization and generation. In code summarization, a model learns to generate a natural language (NL) summary given a piece of code, and in code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as target-to-NL-to-source generation. We take advantage of labeled data for the code summarization task. We show that our proposed framework performs comparably to state-of-the-art methods, if not exceeding their translation performance between Java and Python languages.
Paper Type: long
0 Replies

Loading