Extreme Value Policy Optimization for Safe Reinforcement Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.
Lay Summary: When we teach AI systems to make decisions for real-world tasks, keeping them safe is a huge challenge — especially because rare but serious mistakes can still happen. Traditional methods often miss these rare “black swan” events, focusing mostly on the average case. To solve this, we propose a new technique called EVO. EVO pays special attention to these rare but risky situations by using the statistical tool, Extreme Value Theory, which is designed to spot and learn from unusual events. This way, our AI can better recognize and avoid the kinds of extreme mistakes that could cause real harm. In our tests, EVO made learning much safer — it greatly reduced the number of serious safety violations, all while keeping the AI just as effective at its job. This helps bring us closer to using AI confidently in places where safety is critical, like large language model, robotics, or autonomous driving.
Link To Code: https://github.com/ShiqingGao/EVO
Primary Area: Reinforcement Learning
Keywords: constrained RL, extreme value theory, constraint satisfaction
Submission Number: 11409
Loading