Abstract: Existing research on CodeGen AI security mainly focuses on red teaming, which aims to uncover vulnerabilities and risks in AI-generated code. However, progress on the blue teaming side remains limited, as effective defenses require a deep security analysis of given tasks and edge cases. To fill in this gap, we propose BlueCodeAgent, an end-to-end blue teaming agent powered by automated red teaming. Our red teaming component generates diverse risky instances, providing effective edge cases and guidance for the subsequent blue teaming process. Our blue teaming agent then conducts multi-level defense, leveraging these red teaming examples to detect previously seen and unseen risk scenarios through constitution summarization and dynamic code analysis. Our evaluation across four representative code-related tasks–bias instruction detection, malicious instruction detection, vulnerable code detection, and prompt injection detection–shows that BlueCodeAgent achieves significant gains over diverse baselines. In particular, for vulnerability detection tasks, BlueCodeAgent integrates dynamic analysis to effectively reduce false positives, a challenging problem as base models tend to be over-conservative. Overall, with GPT-4o as the base model, BlueCodeAgent achieves an average F1 score improvement of 14.7% across four tasks compared to directly prompting the model, attributed to its ability to summarize actionable constitutions and perform dynamic analysis. Our code and data are publicly available at https://github.com/1mocat/BlueCodeAgent.
Lay Summary: Code generation AI systems are increasingly used, but they can also produce unsafe or harmful code. Most existing research focuses on “red teaming”, which means finding vulnerabilities and risky behaviors in these systems. In contrast, much less work has studied how to actively defend against these risks.
We introduce BlueCodeAgent, an Agentic defense system designed to improve the safety of code-generation models. Our method first uses automated red teaming to create diverse risky examples and edge cases. BlueCodeAgent then learns from these examples to recognize both known and previously unseen threats. BlueCodeAgent improves defense reliability by combining constitution summarization with dynamic code analysis.
We evaluate BlueCodeAgent on four important security-related tasks, including detecting biased instructions, malicious requests, insecure code, and prompt injection attacks. Our results show that BlueCodeAgent substantially improves detection performance compared to standard prompting methods. In particular, dynamic analysis helps reduce false alarms, which are common in existing AI systems. Overall, our work demonstrates how combining automated attack generation with defense strategies can make AI coding assistants safer and more trustworthy for real-world software development.
Primary Area: Social Aspects->Safety
Keywords: LLM, Code generation, safety, security
Originally Submitted PDF: pdf
Submission Number: 18406
Loading