Safety by Design: High-Probability Constrained Contextual Bandits

Spyros Dragazis; Aldo Pacchiano

Safety by Design: High-Probability Constrained Contextual Bandits

Spyros Dragazis, Aldo Pacchiano

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bandit Algorithms, Contextual Bandits, Linear Bandits, Constrained Bandits

TL;DR: We propose an algorithm to solve the contextual bandit problem, that involves a reward and a cost signal, with stagewise constraints for the realization of the cost signal.

Abstract: Multi-Armed Bandit algorithms have emerged as a fundamental framework for numerous recent applications, including reinforcement learning from human feedback (RLHF), optimal dosage determination, experimental design, advertising, recommendation systems, and fairness. Safety constraints are commonly incorporated to address real-world requirements such as preventing private information leakage in large language models, avoiding overdosing scenarios, and protecting vulnerable societal or client groups under optimistically deployed policies. One approach to modeling constrained optimization problems involves introducing two parametric unknown signals: a reward signal and a cost signal. The objective remains maximizing the expected reward while the stage-wise constrained formulation requires a specified statistic of the cost signal to remain within a predefined safety interval. Previous research has developed algorithms ensuring that the expected value of the cost signal remains below a desired threshold, with constraints satisfied with high probability. In this work, we extend these concepts to control the actual realization of the cost signal, ensuring it lies within the safety region with high probability. This advancement opens new directions for applications where hard safety constraints must be satisfied not merely in expectation but with near-certainty. We present an algorithm with accompanying regret bounds, initially for linear reward and cost signals, then generalize to broader function classes by parameterizing our results using the eluder dimension.

Submission Number: 156

Loading