Characteristics of Effective Exploration for Transfer in Reinforcement Learning

Jonathan C Balloch; Rishav Bhagat; Geigh Zollicoffer; Ruoran Jia; Julia Kim; Mark Riedl

Characteristics of Effective Exploration for Transfer in Reinforcement Learning

Jonathan C Balloch, Rishav Bhagat, Geigh Zollicoffer, Ruoran Jia, Julia Kim, Mark Riedl

Published: 04 Jun 2024, Last Modified: 19 Jul 2024Finding the Frame: RLC 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Transfer Learning, Exploration, Test-Time Adaptation

TL;DR: Identifying the common characteristics in conventional exploration algorithms, and determine which characteristics are most suitable for transferring policies between different MDPs.

Abstract:

In reinforcement learning (RL), exploration is used to help policy models learn to solve individual tasks more efficiently and in increasingly challenging environments. In many real-world applications of RL, however, environments are non-stationary; they can change in unanticipated and unanticipatable ways, and there are condi- tions in which the agent must adapt its policy online, at test time, to the changed environment. Given that most exploration methods are designed for stationary MDPs of single tasks, it is not well understood which exploration methods are most beneficial to efficient online task transfer. Our first contribution is to categorize an array of exploration methods according to common “characteristics” such as being designed around the principles of a separate exploration objective or adding noise to the RL process. We then evaluated eleven exploration algorithms within and across characteristics on the efficiency of adaptation and transfer in multiple discrete and continuous domains. Our results show that exploration methods designed around the principle of explicit diversity and stochasticity most consistently benefit policy transfer. Additionally, our analysis considers the reasons that some characteristics correlate with improved performance and efficiency across multiple tasks, while oth- ers only improve transfer performance with respect to specific tasks. We conclude by discussing the potential implications for future exploration algorithms to most efficiently adapt to unexpected test-time environment changes

Submission Number: 2

Loading