Abstract: This paper considers the problem of solving constrained reinforcement learning problems with
anytime guarantees, meaning that the algorithmic solution returns a safe policy regardless of when
it is terminated. Drawing inspiration from anytime constrained optimization, we introduce Reinforcement
Learning-based Safe Gradient Flow (RL-SGF), an on-policy algorithm which employs
estimates of the value functions and their respective gradients associated with the objective and
safety constraints for the current policy, and updates the policy parameters by solving a convex
quadratically constrained quadratic program. We show that if the estimates are computed with a
sufficiently large number of episodes (for which we provide an explicit bound), safe policies are
updated to safe policies with a probability higher than a prescribed tolerance. We also show that
iterates asymptotically converge to a neighborhood of a KKT point, whose size can be arbitrarily
reduced by refining the estimates of the value function and their gradients. We illustrate the
performance of RL-SGF in a navigation example.
Loading