Abstract: We consider reinforcement learning (RL) in Markov decision processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step <inline-formula><tex-math notation="LaTeX">$t$</tex-math></inline-formula> , it earns a reward and also incurs a cost vector consisting of <inline-formula><tex-math notation="LaTeX">$M$</tex-math></inline-formula> costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> time steps while simultaneously ensuring that the average values of the <inline-formula><tex-math notation="LaTeX">$M$</tex-math></inline-formula> cost expenditures are bounded by agent-specified thresholds <inline-formula><tex-math notation="LaTeX">$c^{\text{ub}}_{i},i=1,2,\ldots,M$</tex-math></inline-formula> . The consideration of the cumulative cost expenditures departs from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner while simultaneously performing the exploration–exploitation tradeoff that is typically encountered in RL tasks. This is challenging since the dual objectives of exploration and exploitation necessarily require the agent to expend resources. In order to measure the performance of an RL algorithm that satisfies the average cost constraints, we define an <inline-formula><tex-math notation="LaTeX">$M+1$</tex-math></inline-formula> dimensional regret vector that is composed of its reward regret, and <inline-formula><tex-math notation="LaTeX">$M$</tex-math></inline-formula> cost regrets. The reward regret measures the suboptimality in the cumulative reward while the <inline-formula><tex-math notation="LaTeX">$i$</tex-math></inline-formula> th component of the cost regret vector is the difference between its <inline-formula><tex-math notation="LaTeX">$i$</tex-math></inline-formula> th cumulative cost expense and the expected cost expenditures <inline-formula><tex-math notation="LaTeX">$Tc^{\text{ub}}_{i}$</tex-math></inline-formula> . We prove that the expected value of the regret vector is upper-bounded as <inline-formula><tex-math notation="LaTeX">$\tilde{O}(T^{2\slash 3})$</tex-math></inline-formula> , where <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> is the time horizon, and <inline-formula><tex-math notation="LaTeX">$\tilde{O}(\cdot)$</tex-math></inline-formula> hides factors that are logarithmic in <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> . We further show how to reduce the regret of a desired subset of the <inline-formula><tex-math notation="LaTeX">$M$</tex-math></inline-formula> costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers nonepisodic RL under average cost constraints and derives algorithms that can <i>tune the regret vector</i> according to the agent's requirements on its cost regrets.
0 Replies
Loading