Summarize and Generate to Back-translate: Unsupervised Translation of Programming LanguagesDownload PDF

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone
Paper Link: https://openreview.net/forum?id=sfZvEKKUHss
Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)
Abstract: Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available. In this approach, a source-to-target model is coupled with a target-to-source model and trained in parallel. While the target-to-source model generates noisy sources, the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Therefore, it is compelling to train them to build programming language translation systems via back-translation. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we suggest performing back-translation via code summarization and generation. In code summarization, a model learns to generate a natural language (NL) summary given a piece of code, and in code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as target-to-NL-to-source generation. We take advantage of labeled data for the code summarization task. We show that our proposed framework performs comparably to state-of-the-art methods, if not exceeding their translation performance between Java and Python languages.
0 Replies

Loading