Adversarial Robust Reward Shaping for Safe Reinforcement Learning in AI-Generated Code

Adversarial Robust Reward Shaping for Safe Reinforcement Learning in AI-Generated Code

ICLR 2026 Conference Submission25368 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adversarial Robust Reward

Abstract: We propose \textbf{Adversarial Robust Reward Shaping (ARRS)}, a novel reinforcement learning framework for generating secure code that explicitly addresses vulnerabilities to adversarial evasion attacks. Conventional reward functions in code generation tasks often do not take into consideration how vulnerable detection mechanisms to subtle perturbations in syntax are which leads to brittle security guarantees. The proposed method integrates an \textbf{Adversarial Robustness Module (ARM)} into the reward computation pipeline, which systematically identifies worst-case failure scenarios through gradient-based perturbation analysis and penalizes the policy for generating exploitable code patterns. ARM works by generating adversarial examples that are semantically preserving and degrade the performance of the code evaluation system to the utmost and then teaching the RL agent to build a solution that is intrinsically secure using a robustness penalty added to the reward signal.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 25368

Loading