Abstract: Binary code search is critical for applications such as plagiarism detection and security analysis, but it is challenging due to compiler-induced transformations at different optimization levels. Existing function similarity methods often fail in large-scale search scenarios, particularly pairwise approaches that struggle with scalability. To address this, we propose BASSET, a novel framework that leverages multilevel hybrid semantic features for efficient large-scale binary function clone search. BASSET decomposes functions into five semantic units and applies various embedding strategies to generate indexing vectors for similarity measurement. Notably, it integrates an expression tree-based representation to capture robust features across compiler optimization levels. By utilizing a learning-to-rank approach with convolutional neural networks, BASSET combines similarity scores from different semantic units to generate a final ranking. Experimental results show that BASSET outperforms existing methods, achieving an AUC of 0.992, an nDCG@10 of 0.853, and a stable MRR of 59%, even as the search space grows.
External IDs:dblp:conf/dsn/ZhaoXYWCZL25
Loading