Sleeping Reinforcement Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent's control. To address these realistic scenarios, we introduce a novel paradigm called *Sleeping Reinforcement Learning*, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, showing that the problem shares the same complexity of standard RL.
Lay Summary: In Reinforcement Learning , an agent learns which actions to perform, i.e., a behavior, in order to solve a sequential decision making problem. The standard assumption is that, at each decision step, the agent selects an action from a fixed and immutable action space. However, in real-world applications, not all actions may be available at every decision stage, with their availability depending on the environment state, on domain-specific constraints, or on other (potentially stochastic) exogenous factors. To address this scenarios, we propose the Sleeping Reinforcement Learning paradigm, extending the standard episodic tabular Reinforcement Learning setting with an action availability model. We study two scenarios, namely action availability revealed for the entire episode and availability revealed for a single stage at a time, and two action availability models, namely independent and Markovian. Using the *regret* (i.e., how much is lost w.r.t. always making optimal decisions) as a performance index, we study the *lower bound*, i.e., the theoretical limit, of the regret and propose algorithms based on the state-of-the-art for standard RL that match such lower bounds up to logarithmic terms.
Link To Code: https://github.com/marcomussi/SleepingRL
Primary Area: Reinforcement Learning->Online
Keywords: Reinforcement Learning, Sleeping, Regret Bounds, Lower Bounds
Submission Number: 5182
Loading