Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

ACL ARR 2024 June Submission2852 Authors

15 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Compilers are complex software containing millions of lines of code, taking years to develop. This paper investigates to what extent Large Language Models (LLMs) can replace hand-crafted compilers in translating high-level programming languages to machine instructions, using C to x86 assembly as a case study. We identify two challenges of using LLMs for code translation and introduce two novel data pre-processing techniques to address the challenges: numerical value conversion and training data resampling. While only using a 13B model, our approach achieves a behavioral accuracy of over 91\%, outperforming the much larger GPT-4 Turbo model by over 50\%. Our results are encouraging, showing that LLMs have the potential to transform how compilation tools are constructed.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: pre-training for MT, few-shot/zero-shot MT, domain adaptation, applications

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: C programming language, x86 assembly language

Submission Number: 2852

Loading