Online Learning with Unknown Constraints

Karthik Sridharan; Seung Won Wilson Yoo

Online Learning with Unknown Constraints

Karthik Sridharan, Seung Won Wilson Yoo

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that characterizes the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.

Lay Summary: How can we design machine learning algorithms that are always safe, even when we don't fully know what 'safe' means at the start? Compared to previous works that study being safe on average, our work tackles the more challenging setting of always being safe. This is useful in settings where you have to be safe, such as preventing a robot from crashing or following laws and regulations. We built algorithms that perform well while also discovering what safe means despite noisy feedback. Our algorithm takes optimistic actions that are guaranteed to perform well and converts them into pessimistic actions that are guaranteed to be safe. We also came up with a rule - technically called a complexity measure - that captures the trade-off between performance and being safe. We also give some examples of our algorithm in action for useful settings.

Primary Area: Theory->Online Learning and Bandits

Keywords: Online Learning, Safe Learning

Submission Number: 4858

Loading