Maximizing Benefits under Harm Constraints: A Generalized Linear Contextual Bandit Approach

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Contextual multi-armed bandit, online learning, generalized linear models, varying coefficient models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: In many contextual sequential decision-making scenarios, such as dose-finding clinical trials for new drugs or personalized news article recommendation systems in social media, each action can simultaneously carry both benefits and potential harm. This could manifest as efficacy versus side effects in clinical trials, or increased user engagement versus the risk of radicalization and psychological distress in news recommendation. These multifaceted situations can be modeled using the multi-armed bandit (MAB) framework. Given the intricate balance of positive and negative outcomes in these contexts, there is a compelling need to develop methods which can maximize benefits while limiting harm within the MAB framework. This paper aims to address this gap. The primary contributions of this paper are two-fold: (i) We propose a novel contextual MAB model with the objective of optimizing reward potential while maintaining certain harm constraints. In this model both rewards and harm are governed by a generalized linear model with coefficients that vary based on the contextual variables. This flexibility allows the model to be broadly applicable for a wide range of scenarios. (ii) Building on our proposed generalized linear contextual MAB model, we develop an $\epsilon$-greedy-based policy. This policy is designed to strike an effective balance between the dual objectives of exploration-exploitation to achieve the desired trade-off between benefit and harm. We demonstrate that this policy achieves a sublinear $\mathcal{O}(\sqrt{T\log T})$ regret. Extensive experimental results are presented to support our theoretical analyses and validate the effectiveness of our proposed model and policy.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2159
Loading