Keywords: Offline Reinforcement Learning, Active Learning
TL;DR: We devised a methodology to iteratively learn improving policies through offline RL, to avoid out-of-distribution state-actions, return-based sampling, to cope with diverse offline data, and active learning, to probe high-uncertainty state-actions.
Abstract: Offline Reinforcement Learning (RL) has emerged as a promising approach to address
real-world challenges where online interactions with the environment are limited, risky,
or costly. Although, recent advancements produce high quality policies from offline data,
currently, there is no systematic methodology to continue to improve them without resorting
to online fine-tuning. This paper proposes to repurpose Offline RL to produce a sequence
of improving policies, namely, Iterative Offline Reinforcement Learning (IORL). To produce
such sequence, IORL has to cope with imbalanced offline datasets and to perform controlled
environment exploration. Specifically, we introduce ”Return-based Sampling” as means
to selectively prioritize experience from high-return trajectories and active learning driven
”Dataset Uncertainty Sampling” to probe state-actions inversely proportional to density
in the dataset.We demonstrate that our proposed approach produces policies that achieve
monotonically increasing average returns, from 65.4 to 140.2, in the Atari environment.
Submission Number: 47
Loading