Avoiding Catastrophe in Online Learning by Asking for Help

Benjamin Plaut; Hanlin Zhu; Stuart Russell

Avoiding Catastrophe in Online Learning by Asking for Help

Benjamin Plaut, Hanlin Zhu, Stuart Russell

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: If a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

Abstract: Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are *catastrophic*, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe in that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We also assume that the agent can transfer knowledge between similar inputs. We first show that in general, any algorithm either queries the mentor at a linear rate or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Although our focus is the product of payoffs, we provide matching bounds for the typical additive regret. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

Lay Summary: Many machine learning algorithms rely on essentially trying all possible behaviors and seeing what works well. However, this is too dangerous when errors can be *catastrophic*, i.e., irreparable. For example, we do not want self-driving cars or surgical robots to try all possible behaviors. One could train an AI system entirely in a controlled lab setting where mistakes are less impactful, but we argue that sufficiently general AI systems will inevitably encounter unfamiliar situations when deployed in the real world. How should AI systems behave in such situations? We suggest that they should *ask for help* (e.g., from a human supervisor). In this paper, we study a mathematical model where the AI system can ask for help a limited number of times. We provide a simple algorithm that mostly acts like a typical learning algorithm, but asks for help in situations that are very different from its prior experiences. We prove that under some mild conditions, our algorithm is guaranteed to avoid catastrophe while gradually becoming self-sufficient. Overall, our work provides a theoretical basis for how AI systems can learn safely in high-stakes applications.

Primary Area: Theory->Online Learning and Bandits

Keywords: online learning, AI safety, asking for help, irreversibility

Submission Number: 941

Loading