Transforming Generic Coder LLMs to Effective Binary Code Embedding Models for Similarity Detection

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: cybersecurity, embedding, binary code
Abstract: Cybersecurity and software research have crossed paths with modern deep learning research for a few years. The power of large language models (LLMs) in particular has intrigued us to apply them to understanding binary code. In this paper, we investigate some of the many ways LLMs can be applied to binary code similarity detection, as it is a significantly more difficult task compared to source code similarity detection due to the sparsity of information and less meaningful syntax. It also has great practical implications, such as vulnerability and malware detection. We find that pretrained LLMs are mostly capable of detecting similar binary code, even with a zero-shot setting. Our main contributions and findings are to provide several supervised fine-tuning methods that, when combined, significantly surpass zero-shot LLMs and state-of-the-art binary code similarity detection methods. Specifically, we up-train the model through data augmentation, translation-style causal learning, LLM2Vec, and cumulative GTE loss. With a complete ablation study, we show that our training method can transform a generic language model into a powerful binary similarity expert, and is also robust and general enough for cross-optimization, cross-architecture, and cross-obfuscation detection.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 26150
Loading