**Abstract:**In light of the Bellman duality, we propose a novel value-policy gradient algorithm to explore and act in infinite-horizon Average-reward Markov Decision Process (AMDP) and show that it has sublinear regret. The algorithm is motivated by the Bellman saddle point formulation. It learns the optimal state-action distribution, which encodes a randomized policy, by interacting with the environment along a single trajectory and making primal-dual updates. The key to the analysis is to establish a connection between the min-max duality gap of Bellman saddle point and the cumulative regret of the learning agent. We show that, for ergodic AMDPs with finite state space $\mathcal{S}$ and action space $\mathcal{A}$ and uniformly bounded mixing times, the algorithm's $T$-time step regret is $$ R(T)=\tilde{\mathcal{O}}\left( \left(t_{mix}^*\right)^2 \tau^{\frac{3}{2}} \sqrt{(\tau^3 + |\mathcal{A}|) |\mathcal{S}| T} \right), $$ where $t_{mix}^*$ is the worst-case mixing time, $\tau$ is an ergodicity parameter, $T$ is the number of time steps and $\tilde{\mathcal{O}}$ hides polylog factors.

0 Replies

Loading