Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes

Abstract: We consider the challenge of policy simplification and verifi- cation in the context of policies learned through reinforce- ment learning (RL) in continuous environments. In well- behaved settings, RL algorithms have convergence guaran- tees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep- RL. To recover guarantees when applying advanced RL algo- rithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a pol- icy obtained via state-of-the-art RL to efficiently train a vari- ational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Ad- ditionally, we obtain a distilled version of the policy for the latent model.
0 Replies
Loading