Stochastic Linear Bandits with Unknown Safety Constraints and Local Feedback

Published: 19 Jun 2023, Last Modified: 09 Jul 2023Frontiers4LCDEveryoneRevisionsBibTeX
Abstract: In many real-world decision-making tasks, e.g. clinical trials, the agents must satisfy a diverse set of unknown safety constraints at all times while getting feedback only on the safety constraints relevant to the chosen action, e.g. the ones close to violation. In this work, we study stochastic linear bandits with such unknown safety constraints and local safety feedback. The agent's goal is to maximize the cumulative reward while satisfying \textit{multiple unknown affine or nonlinear} safety constraints. At each time step, the agent receives noisy feedback on a particular safety constraint \textit{only if} the chosen action belongs to the associated constraint set, i.e. local safety feedback. For this setting, we design upper confidence bound and Thompson Sampling-based algorithms. In the design of these algorithms, we carefully prescribe an additional exploration incentive that guarantees the selection of high-reward actions that are also safe and ensures sufficient exploration in the relevant constraint sets to recover the optimal safe action. We show that for $M$ distinct constraints, both of these algorithms attain $\tilde{\mathcal{O}}(\sqrt{MT})$ regret after $T$ time steps without any safety violations. We empirically study the performance of the proposed algorithms under various safety constraints and with a real-world credit dataset. We show that both algorithms safely explore and quickly recover the optimal safe actions.
Keywords: Stochastic Linear Bandits, Safety Constraints, Nonlinear Constraints, Exploration
Submission Number: 47
Loading