Keywords: LLM, Code generation, safety, security
Abstract: As large language models (LLMs) are increasingly used for code generation, concerns over the security risks have grown substantially.
Early research has primarily focused on red teaming, which aims to uncover and evaluate vulnerabilities and risks of codeGen models.
However, progress on the blue teaming side, which is challenging and requires defense with semantic understanding, remains limited. To fill in this gap, we propose BlueCodeAgent, an end-to-end blue teaming agent enabled by automated red teaming.
Our framework integrates both sides: red teaming generates diverse risky instances, while the blue teaming agent leverages these to detect previously seen and unseen risk scenarios through constitution and code analysis with agentic integration for multi-level defense.
Our evaluation across three representative code-related tasks—bias instruction detection, malicious instruction detection, and vulnerable code detection—shows that BlueCodeAgent achieves significant gains over the baseline models and safety prompt-based defenses.
In particular, for vulnerable code detection tasks, BlueCodeAgent has integrated dynamic analysis to effectively reduce false positives, a critical but difficult-to-address problem.
Overall, BlueCodeAgent achieves much more effective and context-aware risk detection and mitigation.
We demonstrate that the red teaming benefits blue teaming by continuously identifying new vulnerabilities, which could significantly enhance defense performance.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 20785
Loading