A General Theory of Relativity in Reinforcement LearningDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: Reinforcement Learning, General RL Theory, Policy Transfer, Dynamics Modeling
Abstract: We propose a new general theory measuring the relativity between two arbitrary Markov Decision Processes (MDPs) from the perspective of reinforcement learning (RL). Considering two MDPs, tasks such as policy transfer, dynamics modeling, environment design, and simulation to reality (sim2real), etc., are all closely related. The proposed theory deeply investigates the connection between any two cumulative expected returns defined on different policies and environment dynamics, and the theoretical results suggest two new general algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which can offer fast policy transfer and dynamics modeling. RPO updates the policy using the \emph{relative policy gradient} to transfer the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model (if there exists) using the \emph{relative transition gradient} to reduce the gap between the dynamics of the two environments. Then, integrating the two algorithms offers the complete algorithm Relative Policy-Transition Optimization (RPTO), in which the policy interacts with the two environments simultaneously, such that data collections from the two environments, policy and transition updates are all completed in a closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPO, RTO and RPTO in the OpenAI gym's classic control tasks by creating policy transfer problems.
One-sentence Summary: A new general RL theory for policy transfer
Supplementary Material: zip
10 Replies

Loading