Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains
Mitigating partial observability is a necessary but challenging task for general reinforcement learning algorithms. To improve an algorithm's ability to mitigate partial observability, researchers need comprehensive benchmarks to gauge progress. Most algorithms tackling partial observability are only evaluated on benchmarks with simple forms of state aliasing, such as feature masking and Gaussian noise. These existing benchmarks do not represent the many forms of partial observability seen in real domains, such as visual occlusion and unknown opponent intent. We argue that a partially observable benchmark should have two key properties. The first is coverage in its forms of partial observability, to ensure an algorithm's generalizability. The second is a large gap between the performance of a memoryless agent and an agent with more state information. This gap implies that an environment is memory improvable: where performance gains in a domain are from an algorithm's ability to learn memory for mitigating partial observability as opposed to other factors. We introduce best-practice experimental guidelines for benchmarking reinforcement learning under partial observability, as well as the open-source library POBAX: Partially Observable Benchmarks in JAX. We characterize the types of partial observability present in various environments and select representative environments for our benchmark. These environments include localization and mapping, visual control, games, and more. Additionally, these tasks are all memory improvable and require hard-to-learn memory functions, providing a concrete signal for partial observability research. This framework includes recommended hyperparameters for out-of-the-box evaluation, as well as highly performant environments implemented in JAX for GPU-scalable experimentation.