Pre-trained Contextual Embedding of Source Code

Aditya Kanade; Petros Maniatis; Gogul Balakrishnan; Kensen Shi

Pre-trained Contextual Embedding of Source Code

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

25 Sept 2019 (modified: 15 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Abstract: The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be finetuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare finetuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22%. We also show its superiority when finetuned with smaller datasets, and over fewer epochs.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/pre-trained-contextual-embedding-of-source/code)

Original Pdf: pdf

19 Replies

Loading