Synthetic Malware at Scale: Malicious Code Generation With Code Transplanting

Guangzhan Wang, Diwei Chen, Xiaodong Gu, Yuting Chen, Beijun Shen

Published: 2026, Last Modified: 30 May 2026IEEE Trans. Software Eng. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Malicious code detection is one of the most essential tasks in safeguarding against security breaches, data compromise, and related threats. While machine learning has emerged as a predominant method for pattern detection, the training process is intricate due to the severe scarcity of malicious code samples. Consequently, machine learning detectors often encounter malicious patterns in limited and isolated scenarios, hindering their ability to generalize effectively across diverse threat landscapes. In this paper, we introduce MalCoder, a novel method for synthesizing malicious code samples. MalCoder enlarges the quantity and diversity of malicious instances by transplanting a set of malicious prototypes into a vast pool of benign code, thereby crafting a diverse array of malicious instances tailored to various application scenarios. For each malware prototype, MalCoder treats it as an incomplete code fragment and crafts its preceding and subsequent contexts through right-to-left and left-to-right code completion respectively. By leveraging GPTs with various sampling strategies, we can instantiate a large number of code samples bearing the malware prototype. Subsequently, MalCoder masks the original prototypes within the transplanted samples and fine-tunes an LLM code generator to reconstruct the original prototype. This process enables the model to seamlessly transplant malicious code fragments into benign code. During inference, MalCoder can automatically insert malicious fragments into benign samples at random positions, transforming benign code into malicious code. We apply MalCoder to a large pool of benign code in CodeSearchNet and craft over 50,000 malicious samples stemming from 39 malicious prototypes. Both qualitative and quantitative analyses show that the generated samples maintain key characteristics of malicious code while blending seamlessly with benign code, which helps in creating realistic and varied training data. Additionally, by using the generated samples as augmented training data, we witness a remarkable surge in malicious code detection capabilities. Specifically, the F1-score experiences a significant increase compared to utilizing only the original prototype samples.

External IDs:dblp:journals/tse/WangCGCS26