CrossCode2Vec: A unified representation across source and binary functions for code similarity detection
Abstract: Code similarity detection identifies code by analyzing similarities in syntax, semantics, and structure, which includes types of tasks: source-to-source, binary-to-binary, and source-to-binary. Due to encoding and representation disparities between source and binary code, existing methods have mainly focused on individual tasks, without providing a universal solution. Additionally, current source-to-binary tasks only achieve one-to-one matching between source code and binary functions, neglecting the one-to-many relationship inherent between source code and its cross-compiled binaries. In this paper, we propose CrossCode2Vec, a unified framework for representing code in both source and binary functions, which aims to bridge the gap in original coding features and provide a standardized similarity measurement across three code similarity detection tasks. For source code and its corresponding compiled binary, we first design an enhanced Abstract Path Context data preprocessing method, construct an abstract syntax tree (AST) from both source code functions and decompiled binary functions, and implement the function embedding followed by the pre-trained Word2vec model. Then we propose a task-specific data sampling strategy. We establish a one-to-one correspondence between source and binary functions through symbol tables and create a one-to-many relationship between source functions and their cross-compiled binaries based on sampling rules. Finally, we employ a hierarchical LSTM-attention network to facilitate the representation and similarity measurement of functions. We conduct both extrinsic and intrinsic evaluations to confirm the effectiveness of CrossCode2Vec in code representation and code similarity tasks, validating its superiority in model architecture and data processing methods. CrossCode2Vec demonstrates stable and exceptional performance across multiple experiments, reinforcing its ability to bridge the gap between source and binary code representations while effectively measuring their similarities.
External IDs:dblp:journals/ijon/YuALHFCS25
Loading