Keywords: Safe RL, State-Augmentation, Hamilton-Jacobi, Reinforcement Learning
TL;DR: Novel bellman equations based on maximum value propagation for multi-objective tasks which require memory (state-augmentation)
Abstract: Hard constraints in reinforcement learning (RL), whether imposed via the reward function or the model architecture, often degrade policy performance.
Lagrangian methods offer a way to blend objectives with constraints, but often require intricate reward engineering and parameter tuning.
In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction.
Namely, we address: (1) the \textbf{Reach-Always-Avoid} problem – of achieving distinct reward and penalty thresholds – and (2) the \textbf{Reach-Reach} problem – of achieving thresholds of two distinct rewards.
We derive explicit, tractable Bellman forms in this context by decomposing our problem.
The RAA and RR problems are fundamentally different from standard sum-of-rewards problems and temporal logic problems, providing a new perspective on constrained decision-making.
We leverage our analysis to propose a variation of Proximal Policy Optimization (\textbf{DO-HJ-PPO}), which solves these problems.
Across a range of tasks for safe-arrival and multi-target achievement, we demonstrate that DO-HJ-PPO out-competes many baselines.
Supplementary Material: zip
Submission Number: 8
Loading