Abstract: In this paper, we propose a novel unsupervised approach for code similarity and clone detection that is based on Graph Neural Networks. We propose a hybrid approach to detect similarities within source code, using centroid distances and a Graph Auto-Encoder that uses a raw abstract syntax trees as input. When compared to \(\mathrm {R_{TV}NN}\) [33], the state-of-the-art unsupervised approach for code similarity and clone detection, our method improves significantly training and inference time efficiency, while preserving or improving precision. In our experiments, our algorithm is on average 77 times faster during training and 21 times faster during inference. This shows that using Graph Auto-Encoders in the domain of source code similarity analysis is the better option in an industrial context or in a production environment. We illustrate this by using our approach to compute source code similarity within a large dataset of phishing kits written in PHP provided by our industry partner.
Loading