Abstract: The rise of code-generation large language models (LLMs) has revolutionized software development by significantly enhancing productivity. However, their reliance on extensive datasets collected from open-source repositories exposes them to backdoor attacks, wherein malicious actors inject poisoned data to manipulate the generated code. These attacks pose serious security risks by embedding vulnerable code snippets into software applications. Existing research primarily focuses on designing stealthy backdoor attacks, leaving a gap in effective defenses.
In this paper, we investigate trigger inversion as a defense mechanism for safeguarding code-generation LLMs. Trigger inversion aims to identify the adversary-defined input patterns (triggers) that activate malicious behavior in backdoored models. We study the effectiveness of two representative adversarial optimization-based inversion algorithms originally developed for general LLMs. Our experiments show that these methods can successfully recover triggers under specific settings in backdoored code LLMs. However, we also observe that inversion effectiveness is highly sensitive to factors such as suffix length and initialization, and that lower loss does not always correlate with successful trigger recovery. These findings highlight the limitations of existing approaches and underscore the urgent need for more robust and generalizable trigger inversion techniques tailored specifically for the code domain.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security and privacy; safety and alignment; robustness; code models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study
Languages Studied: Python, English
Submission Number: 7580
Loading