DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Marie-anne Lachaux; Baptiste Roziere; Marc Szafraniec; Guillaume Lample

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Marie-anne Lachaux, Baptiste Roziere, Marc Szafraniec, Guillaume Lample

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 PosterReaders: Everyone

Keywords: code, obfuscation, programming languages, pre-training, deobfuscation, ML for code, ML for Programming Languages

TL;DR: We propose a new objective leveraging the structural aspect of programming languages. It outperforms existing approaches on multiple downstream tasks and is also able to deobfuscate fully obfuscated files and suggest descriptive variable names.

Abstract: Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

Supplementary Material: pdf

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Code: https://github.com/facebookresearch/CodeGen

15 Replies

Loading