Language-Agnostic Representation Learning of Source Code from Structure and ContextDownload PDF

Sep 28, 2020 (edited Feb 10, 2022)ICLR 2021 PosterReaders: Everyone
  • Keywords: machine learning for code, code summarization
  • Abstract: Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.
  • One-sentence Summary: Language-agnostic learning from Structure and Context of programs improves learning.
  • Supplementary Material: zip
  • Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
  • Code: [![github](/images/github_icon.svg) danielzuegner/code-transformer](https://github.com/danielzuegner/code-transformer) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=Xh5eMZVONGF)
  • Data: [CodeSearchNet](https://paperswithcode.com/dataset/codesearchnet)
14 Replies

Loading