GenTAL: Generative Denoising Skip-gram Transformer for Unsupervised Binary Code Similarity Detection

Litao Li; Steven Ding; Philippe Charland; Hanbo Yu; Christopher James Molloy

GenTAL: Generative Denoising Skip-gram Transformer for Unsupervised Binary Code Similarity Detection

Litao Li, Steven Ding, Philippe Charland, Hanbo Yu, Christopher James Molloy

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Representation Learning, Transformer, Autoencoder, Binary Code Similarity Detection

Abstract: Binary code similarity detection serves a critical role in cybersecurity. It alleviates the huge manual effort required in the reverse engineering process for malware analysis and vulnerability detection, where often the original source code is not available for analysis. Most of the existing solutions focus on a manual feature engineering process and customized code matching algorithms that are inefficient and inaccurate. Recent deep-learning-based solutions embed the semantics of binary code into a latent space through supervised contrastive learning. However, one cannot cover all its possible forms in the training set to learn the variance of the same semantic. In this paper, we propose an unsupervised model aiming to learn the intrinsic representation of assembly code semantics. Specifically, we propose a Transformer-based auto-encoder-like language model for the low-level assembly code grammar to capture the abstract semantic representation. By coupling a Transformer encoder and a skip-gram-style loss design, it can learn a compact representation that is robust against different compilation options. We conduct experiments on four different block-level code similarity tasks. It shows that our method is more robust compared to the state-of-the-art.

One-sentence Summary: This paper proposes a novel approach for unsupervised Transformer-based language model for code semantic representation learning.

4 Replies

Loading