Abstract: We study query and computationally efficient planning algorithms for discounted Markov decision processes (MDPs) with linear function approximation and a simulator. The agent is assumed
to have local access to the simulator, meaning that the simulator can be queried only at states that
have been encountered in previous steps. We propose two new algorithms for this setting, which we
call confident Monte Carlo least-squares policy iteration (CONFIDENT MC-LSPI), and confident
Monte Carlo Politex (CONFIDENT MC-POLITEX), respectively. The main novelty in our algorithms is that they gradually build a set of state-action pairs (“core set”) with which it can control
the extrapolation errors. We show that our algorithms have polynomial query and computational
cost in the dimension of the features, the effective planning horizon and the targeted sub-optimality,
while the cost remains independent of the size of the state space. An interesting technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy
iteration algorithm. We use this method to leverage existing results on approximate policy iteration
with `∞-bounded error to show that our algorithm can learn the optimal policy for the given initial
state even only with local access to the simulator. We believe that this technique can be extended
to broader settings beyond this work.
0 Replies
Loading