Keywords: Reinforcement learning, fairness, regret minimization, multi-objective optimization, constrained Markov decision processes
Abstract: We consider reinforcement learning with vectorial rewards, where the agent receives a vector of $K\geq 2$ different types of rewards at each time step. The agent aims to maximize the minimum total reward among the $K$ reward types. Different from existing works that focus on maximizing the minimum expected total reward, i.e. \emph{ex-ante max-min fairness}, we maximize the expected minimum total reward, i.e. \emph{ex-post max-min fairness}. Through an example and numerical experiments, we show that the optimal policy for the former objective generally does not converge to optimality under the latter, even as the number of time steps $T$ grows. Our main contribution is a novel algorithm, Online-ReOpt, that achieves near-optimality under our objective, assuming an optimization oracle that returns a near-optimal policy given any scalar reward. The expected objective value under Online-ReOpt is shown to converge to the asymptotic optimum as $T$ increases. Finally, we propose offline variants to ease the burden of online computation in Online-ReOpt, and we propose generalizations from the max-min objective to concave utility maximization.
One-sentence Summary: We develop near optimal algorithm for reinforcement learning with vectorial rewards, where we are maximizing the expected minimum reward over $K>1$ reward types.
Supplementary Material: zip
11 Replies
Loading