The Effectiveness of Pre-Trained Code Embeddings

Ben Trevett; Donald Reay; N. K. Taylor

The Effectiveness of Pre-Trained Code Embeddings

Ben Trevett, Donald Reay, N. K. Taylor

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Word embeddings are widely used in machine learning based natural language processing systems. It is common to use pre-trained word embeddings which provide benefits such as reduced training time and improved overall performance. There has been a recent interest in applying natural language processing techniques to programming languages. However, none of this recent work uses pre-trained embeddings on code tokens. Using extreme summarization as the downstream task, we show that using pre-trained embeddings on code tokens provides the same benefits as it does to natural languages, achieving: over 1.9x speedup, 5\% improvement in test loss, 4\% improvement in F1 scores, and resistance to over-fitting. We also show that the choice of language used for the embeddings does not have to match that of the task to achieve these benefits and that even embeddings pre-trained on human languages provide these benefits to programming languages.

Keywords: machine learning, deep learning, summarization, embeddings, word embeddings, source code, programming languages, programming language processing

TL;DR: Researchers exploring natural language processing techniques applied to source code are not using any form of pre-trained embeddings, we show that they should be.

9 Replies

Loading