Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

ACL ARR 2024 June Submission3686 Authors

16 Jun 2024 (modified: 20 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 146.54%, and outperforms the latest binary code similarity detection techniques by up to 6.17%, showing promising abilities in both assembly generation and understanding tasks.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: pre-training, fine-tuning, applications

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: programming language, C, Assembly

Submission Number: 3686

Loading