Multi-Step Generalized Policy Improvement by Leveraging Approximate Models

Lucas Nunes Alegre; Ana L. C. Bazzan; Ann Nowe; Bruno Castro da Silva

Multi-Step Generalized Policy Improvement by Leveraging Approximate Models

Lucas Nunes Alegre, Ana L. C. Bazzan, Ann Nowe, Bruno Castro da Silva

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: generalized policy improvement, successor features, transfer learning, model-based reinforcement learning

TL;DR: We introduce $h$-GPI, a multi-step extension of generalized policy improvement that interpolates between standard model-free GPI and fully model-based planning as a function of a parameter, $h$, regulating the amount of time the agent has to reason.

Abstract: We introduce a principled method for performing zero-shot transfer in reinforcement learning (RL) by exploiting approximate models of the environment. Zero-shot transfer in RL has been investigated by leveraging methods rooted in generalized policy improvement (GPI) and successor features (SFs). Although computationally efficient, these methods are model-free: they analyze a library of policies---each solving a particular task---and identify which action the agent should take. We investigate the more general setting where, in addition to a library of policies, the agent has access to an approximate environment model. Even though model-based RL algorithms can identify near-optimal policies, they are typically computationally intensive. We introduce $h$-GPI, a multi-step extension of GPI that interpolates between these extremes---standard model-free GPI and fully model-based planning---as a function of a parameter, $h$, regulating the amount of time the agent has to reason. We prove that $h$-GPI's performance lower bound is strictly better than GPI's, and show that $h$-GPI generally outperforms GPI as $h$ increases. Furthermore, we prove that as $h$ increases, $h$-GPI's performance becomes arbitrarily less susceptible to sub-optimality in the agent's policy library. Finally, we introduce novel bounds characterizing the gains achievable by $h$-GPI as a function of approximation errors in both the agent's policy library and its (possibly learned) model. These bounds strictly generalize those known in the literature. We evaluate $h$-GPI on challenging tabular and continuous-state problems under value function approximation and show that it consistently outperforms GPI and state-of-the-art competing methods under various levels of approximation errors.

Supplementary Material: zip

Submission Number: 4577

Loading