Towards Safe Reinforcement Learning via Constraining Conditional Value-at-RiskDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Abstract: Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty caused by stochasticity in both environments and policies. Existing safe reinforcement learning methods are often based on transforming the optimization criterion and adopting the variance of the return as a measure of uncertainty. However, the return variance introduces a bias for penalizing both positive and negative risk equally, deviated from the purpose of safe reinforcement learning to penalize negative ones only. To address this issue, we propose to use the conditional value-at-risk (CVaR) as an assessment of risk, which guarantees that the probability for reaching a catastrophic state is below a desired threshold. Furthermore, we present a novel reinforcement learning framework of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. To evaluate the robustness of policies, we theoretically prove that performance degradation under observation disturbance and transition disturbance depends on the gap of value function between the best state and the worst state. We also show that CPPO can generate more robust policies under disturbance. Experimental results show that CPPO achieves higher cumulative reward and exhibits stronger robustness against observation disturbance and transition disturbance on a series of continuous control tasks in MuJoCo.
15 Replies

Loading