Pure Exploration in Episodic Fixed-Horizon Markov Decision Processes

Sudeep Raja Putta, Theja Tulabandhula

2017 (modified: 30 Mar 2022)AAMAS 2017Readers: Everyone

Abstract: Multi-Armed Bandit (MAB) problems can be naturally extended to Markov Decision Processes (MDP). We extend the Best Arm Identification problem to episodic fixed-horizon MDPs. Here, the goal of an agent interacting with the MDP is to reach a high confidence on the optimal policy in as few episodes as possible. We propose Posterior Sampling for Pure Exploration (PSPE), a Bayesian algorithm for pure exploration in MDPs. We empirically show that PSPE achieves deep exploration and the number of episodes required by PSPE for reaching a fixed confidence value is exponentially lower than random exploration and lower than reward maximizing algorithms such as Posterior Sampling for Reinforcement Learning (PSRL).

0 Replies