Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: offline policy learning, bandits, federated learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: We consider the problem of learning personalized decision policies from observational bandit feedback data across multiple heterogeneous data sources. Moreover, we examine the practical considerations of this problem in the federated setting where a central server aims to train a policy on data distributed across the heterogeneous sources, or clients, without collecting any of their raw data. We present a policy learning algorithm amenable to federation based on the aggregation of local policies trained with doubly robust offline policy evaluation and learning strategies. We provide a novel regret analysis for our approach that establishes a finite-sample upper bound on a notion of global regret against a mixture distribution of clients. In addition, for any individual client, we establish a corresponding local regret upper bound characterized by measures of relative distribution shift to all other clients. Our analysis and supporting experimental results provide insights into tradeoffs in the participation of heterogeneous data sources in policy learning.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7156
Loading