- Keywords: Reinforcement Learning, Experience Replay
- Abstract: Training sample section techniques, such as prioritized experience replay (PER), have been recognized as of significant importance for online reinforcement learning algorithms. Efficient sample selection can help further improve the learning efficiency and the final learning performance. However, the impact of sample selection for batch reinforcement learning algorithms, where we aim to learn a near-optimal policy exclusively from the offline logged dataset, has not been well studied. In this work, we investigate the application of non-uniform sampling techniques in batch reinforcement learning. In particular, we compare six variants of PER based on various heuristic priority metrics that focus on different aspects of the offline learning setting. These metrics include temporal-difference error, n-step return, self-imitation learning objective, pseudo-count, uncertainty, and likelihood. Through extensive experiments on the standard batch RL datasets, we find that non-uniform sampling is also effective in batch RL settings. Furthermore, there is no single metric that works in all situations. Our findings also show that it is insufficient to avoid the bootstrapping error in batch reinforcement learning by only changing the sampling scheme.
- One-sentence Summary: Benchmarking non-uniform sampling strategies in batch reinforcement learning.