Keywords: procedurally generated environments, curriculum learning, meta-learning, Procgen benchmark
TL;DR: Dispersion of returns can be used as an alternative to TD errors to score procedurally generated levels for future learning potential
Abstract: Prioritized Level Replay (PLR) has been shown to induce adaptive curricula that improve the sample-efficiency and generalization of reinforcement learning policies in environments featuring multiple tasks or levels. PLR selectively samples training levels weighed by a function of recent temporal-difference (TD) errors experienced on each level. We explore the dispersion of returns as an alternative prioritization criterion to address certain issues with TD error scores.