Language-Agnostic Representation Learning of Source Code from Structure and ContextDownload PDF


28 Sep 2020 (modified: 20 Nov 2020)ICLR 2021 Conference Blind SubmissionReaders: Everyone
  • Keywords: machine learning for code, code summarization
  • Abstract: Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.
  • Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
  • One-sentence Summary: Language-agnostic learning from Structure and Context of programs improves learning.
  • Supplementary Material: zip
13 Replies