Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games
TL;DR: We propose a meta algorithm for online model selection in average reward RL with its theoretical guarantees, application in repeated games and and empirical demonstrations.
Abstract: In standard RL, the structure of the Markov Decision Process (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of $M >1$ model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose $\textsf{MRBEAR}$, an online model selection algorithm for the average reward RL setting which is based on the idea of regret balancing and elimination. The regret of the algorithm is in $\tilde O(M C_{m*}^2 B_{m*}(T,\delta))$ where $C_{m*}$ represents the complexity of the simplest well-specified model class and $B_{m^*}(T,\delta)$ is its corresponding regret bound. This result shows that in average reward RL, the additional cost of model selection scales only linearly in $M$, the number of model classes.
As an application, in a simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy, the learner can maximize its utility using $\textsf{MRBEAR}$. By proving a lower bound, we showed the learner's regret is tight in opponent's memory order. In addition, the algorithm's performance is demonstrated through experiments.
Submission Number: 1877
Loading