- Keywords: Reinforcement Learning, Efficient Exploration, Asymptotic Analysis, Statistical Inference
- TL;DR: We investigate the large-sample behaviors of the Q-value estimates and proposed an efficient exploration strategy that relies on estimating the relative discrepancies among the Q estimates.
- Abstract: Despite an ever growing literature on reinforcement learning algorithms and applications, much less is known about their statistical inference. In this paper, we investigate the large-sample behaviors of the Q-value estimates with closed-form characterizations of the asymptotic variances. This allows us to efficiently construct confidence regions for Q-value and optimal value functions, and to develop policies to minimize their estimation errors. This also leads to a policy exploration strategy that relies on estimating the relative discrepancies among the Q estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark approaches.