Abstract: We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where
each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for τ time steps scales as ε(τ ) = ε0 + O(sqrt(d/τ)), where ε0 is the
worst-case approximation error and d is the number of features in a compressed representation of
the state-action space. We establish that this condition is satisfied by the LSPE algorithm under
certain assumptions on the MDP and policies. Under the error assumption, we show that the regret
of POLITEX in uniformly mixing MDPs scales as O(d^{1/2}T^{3/4 + ε0T), where O(·) hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.
0 Replies
Loading