Best Policy Identification in Discounted Linear MDPsDownload PDF

Published: 20 Jul 2023, Last Modified: 01 Sept 2023EWRL16Readers: Everyone
Keywords: Discounted MDPs, Linear MDPs, Pure exploration, Track-and-stop, Forward model, Best-policy identification
Abstract: We consider the problem of best policy identification in discounted Linear Markov Decision Processes in the fixed confidence setting, under both generative and forward models. We derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-\delta$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. In the generative model, our algorithm exhibits a sample complexity upper bounded by ${\cal O}( (d(1-\gamma)^{-4}/ (\varepsilon+\Delta)^2) (\log(1/\delta)+d))$ where $\Delta$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $\delta$), and matches existing minimax and gap-dependent lower bounds. In the forward model, we determine a condition under which learning approximately optimal policies is possible; this condition is weak and does not require the MDP to be ergodic nor communicating. Under this condition, the sample complexity of our algorithm is asymptotically (as $\delta$ approaches 0) upper bounded by ${\cal O}((\sigma^\star(1-\gamma)^{-4}/{(\varepsilon+\Delta)^2}) (\log(\frac{1}{\delta})))$ where $\sigma^\star$ is an instance-specific constant, value of an optimal experiment-design problem. To derive this bound, we establish novel concentration results for random matrices built on Markovian data.
1 Reply

Loading