Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies
Abstract: Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive, such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there isn't a single established protocol for evaluating offline RL methods. In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Added v-d4rl results - Added analysis of $K$ and $\gamma$ as well as updated text to show connection to replay ratio and online RL - Updated manuscript to highlight justification for the given parameterisation of SeqEval - Added results where off policy metrics are used instead of episodic return as the evaluation metric - Added comparison of SeqEval with traditional mini batch style training as well as the trade-offs associated with each - Fixed missing curves in the main experimental results
Assigned Action Editor: ~Manuel_Haussmann1
Submission Number: 1367