Online Adversarial MDPs with Off-Policy Feedback and Known TransitionsDownload PDF

Published: 20 Jul 2023, Last Modified: 30 Aug 2023EWRL16Readers: Everyone
Keywords: Adversarial Online Learning, Off-Policy
Abstract: In this paper, we face the challenge of online learning in adversarial Markov decision processes with known transitions and off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague's policy), whose losses are revealed to the learner. The off-policy feedback presents an additional technical issue that is not present in traditional exploration-exploitation trade-off problems: the learner is charged with the regret of its chosen policy (w.r.t. a comparator policy) but it observes only the losses suffered by the colleague's policy. Contrariwise, we propose novel algorithms that, by employing pessimistic estimators---commonly adopted in the off-line reinforcement learning literature---ensure sublinear regret bounds depending on the more desirable dissimilarity between any comparator policy and the colleague's policy, even when the latter is unknown.
1 Reply

Loading