Online Regret Bounds for Satisficing in MDPsDownload PDF

Published: 20 Jul 2023, Last Modified: 29 Aug 2023EWRL16Readers: Everyone
Keywords: regret, satisficing
TL;DR: We derive the first online regret bounds for satisficing in MDPs
Abstract: We consider general reinforcement learning under the average reward criterion in Markov decision processes (MDPs) when the learner's goal is not to learn an optimal policy but accepts any policy whose average reward is above a certain given satisfaction level~$\sigma$. We show that with this more modest objective it is possible to have algorithms that only have constant regret with respect to the level~$\sigma$, provided that there is a policy above this level. This result generalizes findings of Bubeck et al. (COLT 2013) from the bandit setting to MDPs. Further, we present a more general algorithm that achieves the best of both worlds: If the optimal policy has average reward above $\sigma$ this algorithm has bounded regret with respect to~$\sigma$. On the other hand, if all policies are below $\sigma$ then we can show logarithmic bounds on the expected regret with respect to the optimal policy.
1 Reply

Loading