Abstract: We consider constrained policy optimization in Reinforcement Learning, where the constraints are in form of marginals on state visitations and global action executions. Given these distributions, we formulate policy optimization as unbalanced optimal transport over the space of occupancy measures. We propose a general purpose RL objective based on Bregman divergence and optimize it using Dykstra's algorithm when the transition model is known. The approach admits an actor-critic algorithm for when the state or action space is large, and only samples from the marginals are available. We discuss applications of our approach and provide demonstrations to show the effectiveness of our algorithm.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - added "when the transition model is known" to the abstract and sec 3 to capture the reviewer's concern about Dykstra's
-added comment after algorithm 1 regarding the double sampling for eq 23
Assigned Action Editor: ~Bo_Dai1
Submission Number: 305
Loading