AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code

Jesia Quader Yuki, Mohammadhossein Amouei, Benjamin C. M. Fung, Philippe Charland, Andrew Walenstein

Published: 2024, Last Modified: 15 May 2025ICSOFT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This study explores the field of software reverse engineering through the lens of code summarization, which involves generating informative and concise summaries of code functionality. A significant aspect of this research is the application of assembly code summarization in malware analysis, highlighting its critical role in understanding and mitigating potential security threats. Although there have been recent efforts to develop code summarization techniques for high-level programming languages, to the best of our knowledge, this study is the first attempt to generate comments for assembly code. For this purpose, we first built a carefully curated dataset of assembly function-comment pairs. We then focused on automatic assembly code summarization using transfer learning with pre-trained natural language processing (NLP) models, including BERT, DistilBERT, RoBERTa, and CodeBERT. The results of our experiments show a notable advantage of Code-BERT: despite its initial training on hi