Factored DRO: Factored Distributionally Robust Policies for Contextual BanditsDownload PDF

Published: 31 Oct 2022, Last Modified: 15 Jan 2023NeurIPS 2022 AcceptReaders: Everyone
Keywords: Contextual Bandits, Distributionally Robust Optimization, Distribution Shifts
Abstract: While there has been extensive work on learning from offline data for contextual multi-armed bandit settings, existing methods typically assume there is no environment shift: that the learned policy will operate in the same environmental process as that of data collection. However, this assumption may limit the use of these methods for many practical situations where there may be distribution shifts. In this work we propose Factored Distributionally Robust Optimization (Factored-DRO), which is able to separately handle distribution shifts in the context distribution and shifts in the reward generating process. Prior work that either ignores potential shifts in the context, or considers them jointly, can lead to performance that is too conservative, especially under certain forms of reward feedback. Our Factored-DRO objective mitigates this by considering the shifts separately, and our proposed estimators are consistent and converge asymptotically. We also introduce a practical algorithm and demonstrate promising empirical results in environments based on real-world datasets, such as voting outcomes and scene classification.
TL;DR: Our algorithm Factored-DRO learns distributionally robust batch contextual bandit policies, and can separately handle distribution shifts in the context distribution and shifts in the reward generating process.
Supplementary Material: pdf
17 Replies

Loading