Pre-Training Representations of Binary Code Using Contrastive Learning

Yifan Zhang; Chen Huang; Yueke Zhang; Huajie Shao; Kevin Leach; Yu Huang

Pre-Training Representations of Binary Code Using Contrastive Learning

Yifan Zhang, Chen Huang, Yueke Zhang, Huajie Shao, Kevin Leach, Yu Huang

Published: 11 Oct 2025, Last Modified: 11 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Binary code analysis and comprehension is critical to applications in reverse engineering and computer security tasks where source code is not available. Unfortunately, unlike source code, binary code lacks semantics and is more difficult for human engineers to understand and analyze. In this paper, we present ContraBin, a contrastive learning technique that integrates source code and comment information along with binaries to create an embedding capable of aiding binary analysis and comprehension tasks. Specifically, we present three components in ContraBin: (1) a primary contrastive learning method for initial pre-training, (2) a simplex interpolation method to integrate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to train a binary code embedding. We further analyze the impact of human-written and synthetic comments on binary code comprehension tasks, revealing a significant performance disparity. While synthetic comments provide substantial benefits, human-written comments are found to introduce noise, even resulting in performance drops compared to using no comments. These findings reshape the narrative around the role of comment types in binary code analysis. We evaluate the effectiveness of ContraBin through four indicative downstream tasks related to binary code: algorithmic functionality classification, function name recovery, code summarization, and reverse engineering. The results show that ContraBin considerably improves performance on all four tasks, measured by accuracy, mean of average precision, and BLEU scores as appropriate. ContraBin is the first language representation model to incorporate source code, binary code, and comments into contrastive code representation learning and is intended to contribute to the field of binary code analysis. The dataset used in this study is available for further research.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We would like to sincerely thank the Action Editor and reviewers for their valuable feedback. We have revised the manuscript to address the minor revisions requested in the Action Editor's summary decision. A list of the key changes is provided below: * **Clarity and Structure in Introduction:** Section 1 has been substantially restructured for improved readability and logical flow. The discussion of evaluation strategy has been separated from the conceptual insights, and the list of contributions has been condensed from six points to a more focused four. In doing so, we also tightened the novelty claim to clearly emphasize our contributions on LLVM IR and contrastive methods. * **Clarified Experimental Protocol and Reproducibility:** In Section 4.1, we now explicitly restate that all baselines share an identical fine-tuning protocol (including dataset, splits, and optimizer). In addition, we added a clarification before Section 4.2 confirming that our evaluation adheres to the CodeXGlue benchmark’s fixed random seed (123456), ensuring reproducibility. * **Improved Figure Captions and Readability:** The caption for Figure 6 has been rewritten to be more self-explanatory, explicitly describing what the blue dots and highlighted case study represent and articulating the main takeaway without requiring readers to consult the main text. References to subfigures have been standardized, and code snippets in Figures 7–8 have been enlarged and relabeled to improve readability. * **Enhanced Reproducibility Visibility:** We have added a footnote on page 3 with a direct link to our public Zenodo repository, which contains the full implementation, datasets, and pre-trained models. * **General Polish and Consistency:** We performed a thorough proofread of the manuscript. This included: - Standardizing the model name to *ContraBin* throughout. - Correcting dataset terminology (e.g., consistently using *DIRE* instead of “DIRT”) and clarifying that “binary code” in our context refers to both executables and lifted IR. - Ensuring consistent metric naming (e.g., *GLEU-4* instead of “GLUE-4”) across all sections and tables. - Correcting minor typos (e.g., *algorithmic* instead of “algorithmec,” *precision* instead of “prevision”). - Revising Table 7 with clearer row labels, a consistent baseline reference, and a note contextualizing the performance of human-written comments. - Updating all URLs to consistent `https://` format where available. We believe these revisions fully address the Action Editor’s requested changes on clarity, reproducibility, and novelty emphasis, while also improving overall readability and consistency. We are grateful for the opportunity to refine our work and hope that the current version is now suitable for publication in TMLR. Thank you for your guidance!

Video: https://www.youtube.com/watch?v=_iR7AoYl5bg

Code: https://zenodo.org/records/15219264

Assigned Action Editor: ~Charles_Xu1

Submission Number: 4672

Loading