Statistical Machine Translation Is a Natural Fit for Automatic Identifier Renaming in Software Source Code

Jeremy Lacomis, Alan Jaffe, Edward J. Schwartz, Claire Le Goues, Bogdan Vasilescu

2018 (modified: 02 Mar 2020)AAAI Workshops 2018Readers: Everyone

Abstract: Advances in natural language processing have led to a variety of successful tools and techniques for solving problems such as understanding, generating, and translating natural languages. Given the success of these techniques, a natural question is whether they can also be applied to programming languages. However, the initial research has been mixed. Researchers attempting to translate between programming languages by employing statistical machine translation (SMT) found that a large percentage of the translated programs were not syntactically valid. On the other hand, SMT has been successfully employed to recover identifiers in obfuscated JavaScript code. In this paper, we discuss several differences between natural languages and programming languages that can thwart successful application of NLP techniques to program transformation. We also discuss several strategies to cope with these differences in practice, using our own experiences with using SMT to assign meaningful identifier names to variables in decompiled C programs as an example.

0 Replies