Keywords: Fair RL, Vector-Valued MDP, PAC-MDP, KWIK Learning
Abstract: We propose a welfare-centric fair reinforcement-learning setting, in which an agent
enjoys vector-valued reward from a set of beneficiaries. Given a welfare function W(·),
the task is to select a policy π̂ that approximately optimizes the welfare of theirvalue
functions from start state s0 , i.e., π̂ ≈ argmaxπ W Vπ1 (s0 ), Vπ2 (s0 ), . . . , Vπg (s0 ) . We
find that welfare-optimal policies are stochastic and start-state dependent. Whether
individual actions are mistakes depends on the policy, thus mistake bounds, regret
analysis, and PAC-MDP learning do not readily generalize to our setting. We develop
the adversarial-fair KWIK (Kwik-Af) learning model, wherein at each timestep,
an agent either takes an exploration action or outputs an exploitation policy, such
that the number of exploration actions is bounded and each exploitation policy
is ε-welfare optimal. Finally, we reduce PAC-MDP to Kwik-Af, introduce the
Equitable Explicit Explore Exploit (E4) learner, and show that it Kwik-Af learns.
Submission Number: 133
Loading