Contrastive Code Representation Learning

Paras Jain; Ajay Jain; Tianjun Zhang; Pieter Abbeel; Joseph E. Gonzalez; Ion Stoica

Contrastive Code Representation Learning

Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: programming languages, representation learning, contrastive learning, unsupervised learning, self-supervised learning, transfer learning, nlp, pretraining, type inference, summarization

Abstract: Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines and top-1 accuracy of type inference baselines by 2% to 13%. ContraCode achieves 9% higher top-1 accuracy than the current state-of-the-art static type analyzer for TypeScript. Finally, representations learned through a hybrid contrastive and reconstruction objective transfer in zero-shot to code clone detection with +10% AUROC over a static text similarity measure and +5% over reconstruction alone.

One-sentence Summary: We learn to represent programs in a self-supervised manner by using compiler transformations as data augmentations.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/contrastive-code-representation-learning/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=NLx5IwJTNc

14 Replies

Loading