Revisiting Familiar Places in an Infinite World: Continuing RL in Unbounded State Spaces

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: reinforcement learning, continuing RL, unbounded state space, reset-free RL
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Deep RL algorithms are fragile in the unbounded state space and continuing setting. To mitigate this, we propose an algorithm that makes RL agents first pursue stability and then optimality.
Abstract: Deep reinforcement learning (RL) algorithms have been successfully applied to train neural network control policies for many sequential decision-making tasks. However, prior work has shown that neural networks are poor extrapolators and deep RL algorithms perform poorly with weakly informative cost signals. In this paper we show that these challenges are particularly problematic in real-world settings in which the state-space is unbounded and learning must be done without regular episodic resets. For instance, in stochastic queueing problems, the state space and cost can be unbounded and the agent may have to learn online without the system ever being reset to states the agent has seen before. In such settings, we show that deep RL agents can diverge into unseen states from which they can never recover, especially in highly stochastic environments. Towards overcoming this divergence, we introduce a Lyapunov-inspired reward shaping approach that encourages the agent to first learn to be stable (i.e. to achieve bounded cost) and then to learn to be optimal. We theoretically show that our reward shaping technique reduces the rate of divergence of the agent and empirically find that it prevents it. We further combine our reward shaping approach with a weight annealing scheme that gradually introduces the pursuit of optimality and a log-transform of state inputs, and find that these techniques enable deep RL algorithms to learn performant policies when learning online in unbounded state space domains.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6591
Loading