Abstract: Reverse engineering of binary executables is a critical problem in the computer
security domain. On the one hand, malicious parties may recover interpretable
source codes from the software products to gain commercial advantages. On the
other hand, binary decompilation can be leveraged for code vulnerability analysis
and malware detection. However, efficient binary decompilation is challenging.
Conventional decompilers have the following major limitations: (i) they are only
applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks; (ii) their output high-level code cannot effectively
preserve the correct functionality of the input binary; (iii) their output program does
not capture the semantics of the input and the reversed program is hard to interpret.
To address the above problems, we propose Coda1
, the first end-to-end neural-based
framework for code decompilation. Coda decomposes the decompilation task into
of two key phases: First, Coda employs an instruction type-aware encoder and a
tree decoder for generating an abstract syntax tree (AST) with attention feeding
during the code sketch generation stage. Second, Coda then updates the code sketch
using an iterative error correction machine guided by an ensembled neural error
predictor. By finding a good approximate candidate and then fixing it towards
perfect, Coda achieves superior performance compared to baseline approaches. We
assess Coda’s performance with extensive experiments on various benchmarks.
Evaluation results show that Coda achieves an average of 82% program recovery
accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0%
accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with
attention by a margin of 70% program accuracy. Our work reveals the vulnerability
of binary executables and imposes a new threat to the protection of Intellectual
Property (IP) for software development.
Loading