CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on Transformer

Yuhong Feng; Haoran Li; Yixuan Cao; Yufeng Wang; Haiyue Feng

CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on Transformer

Yuhong Feng, Haoran Li, Yixuan Cao, Yufeng Wang, Haiyue Feng

Published: 01 Jan 2024, Last Modified: 09 Aug 2024Internetware 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Binary code similarity detection (BCSD) is widely used in software analysis such as vulnerability detection and malware identification. Among various forms of binary representation, assembly is particularly feasible for real-world applications due to its efficient preprocessing compared to graph and intermediate representation (IR). Existing assembly-based methods leverage the text embedding capabilities of pretrained language models such as BERT, which still encounter limitations in cross-architecture BCSD due to the characteristics of assembly code and the lack of cross-architecture vocabulary. In this paper, we first design several normalization strategies to preprocess assembly code from multiple instruction set architectures (ISAs), in order to decrease the token length of assembly code inputs and reduce the size of vocabulary, thereby improving processing efficiency and simplifying model structure. Then, we propose a method to collect token instances and construct a tokenizer capable of processing assembly code from multiple ISAs, enhancing the model’s ability to interpret such code. Based on this tokenizer, we develop a CRoss-Architecture Binary code Similarity detection model based on Transformer (CRABS-former). CRABS-former compares two binary functions from different ISAs, compilers or optimization options and computes their similarity score. Finally, we conduct experiments for two BCSD tasks (one-to-one and one-to-many) using CRABS-former, comparing its performance against four baselines: SAFE, Trex, jTrans, and TE3L. The results indicate that CRABS-former, with a pool size of 10,000, improves recall by 10.85%, 18.02%, and 3.33% across different ISAs, compilers, and optimizations, respectively, underscoring the effectiveness of our approach.

Loading