Keywords: LLM, Reverse Engineering, Function Name Recovery, Stripped Binaries, Data Conversion
TL;DR: Motivated by the limited availability of training data and the challenges LLMs face in function name recovery, we propose a new framework that is both semantic-aware and self-transformative.
Abstract: Reverse engineers aim to analyze stripped binaries in order to identify and mitigate software vulnerabilities. Unlike source code, real-world binaries contain limited semantic information, as companies often remove symbols to reduce file size and protect intellectual property. This lack of information makes program comprehension challenging. Since binaries consist of numerous functions, recovering meaningful function names is a crucial step toward understanding program behavior. Recent work has applied machine learning to this task, and Large Language Models (LLMs) have shown promise due to their ability to generate contextually relevant identifiers beyond the training set. However, progress is hindered by limited training data, underscoring the need for optimized fine-tuning strategies. To address this, we propose SymSem, a self-transformative, semantic-aware fine-tuning framework for function name recovery. We evaluate SymSem across three architectures, x86-64, ARM, and MIPS, and demonstrate that it significantly outperforms prior approaches, achieving up to 68% higher F1 score on MIPS compared to the state-of-the-art.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 18233
Loading