Introducing Compiler Semantics into Large Language Models as Programming Language Translators: A Case Study of C to x86 Assembly

ACL ARR 2024 June Submission2852 Authors

15 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Compilers are complex software containing millions of lines of code, taking years to develop. This paper investigates to what extent Large Language Models (LLMs) can replace hand-crafted compilers in translating high-level programming languages to machine instructions, using C to x86 assembly as a case study. We identify two challenges of using LLMs for code translation and introduce two novel data pre-processing techniques to address the challenges: numerical value conversion and training data resampling. While only using a 13B model, our approach achieves a behavioral accuracy of over 91\%, outperforming the much larger GPT-4 Turbo model by over 50\%. Our results are encouraging, showing that LLMs have the potential to transform how compilation tools are constructed.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: pre-training for MT, few-shot/zero-shot MT, domain adaptation, applications
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: C programming language, x86 assembly language
Submission Number: 2852
Loading