Formal-Lagrangian Policy Optimization for Safe Reinforcement Learning in Code Generation with Differentiable Verification

Formal-Lagrangian Policy Optimization for Safe Reinforcement Learning in Code Generation with Differentiable Verification

ICLR 2026 Conference Submission25415 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation

Abstract: \begin{abstract} We propose Formal-Lagrangian Policy Optimization (FLPO), an original framework of safe reinforcement learning (RL) in code generation that combines safe image inspection and policy optimization through a Lagrangian multiplier mechanism. The major bottleneck to RL-based code synthesis, however, is to ensure the constraints of hard safety, such as memory safety or type correctness, without losing the flexibility of generative models. FLPO addresses this by adding to the reward function a Lagrangian to dynamically penalise constraint violations, the penalty weight of which is adapted using the dual ascent to decrease the importance of safety issues downwards. Moreover, we propose a differentiable formal verification layer to approximate the verification results into a continuous value gradient so that the policy network can also learn straight from formal feedback. \end{abstract}

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 25415

Loading