Exploring Non-Markov Environments Using Random Recurrent Memories

Published: 10 Jun 2026, Last Modified: 10 Jun 2026RL in Big Worlds PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, exploration, partial observability
TL;DR: To explore under partial observability, searching over history is important. We can do so with random recurrences for count-based exploration.
Abstract: Exploring an environment requires a reinforcement learning agent to keep track of what it has already explored, which is difficult when observations do not fully reveal environment state. In the Markov setting, exploration algorithms have focused on achieving systematic coverage of observed states. These same methods have frequently been applied to the non-Markov setting, where they aim to achieve coverage over observations. However, decision-making in the non-Markov setting often depends on the agent's entire history, as opposed to only single observations. Unfortunately, achieving systematic coverage over the space of all trajectories is untenable: exploration over histories is exponentially expensive due to the dependence of the search space on the horizon. We therefore propose a new family of methods to featurize the agent's history with random recurrences. This produces finitely-sized random statistics, or *random recurrent memories*, over an agent's history, and we aim for coverage over these memories. We describe desirable properties for efficient history compression with random recurrences and propose a new architecture type, the tangent recurrent unit (TRU). We show that in a diverse suite of partially observable exploration tasks, tangent recurrent units, as well as other structured random recurrences, outperform popular methods that aim for observation coverage.
Supplementary Material: pdf
Submission Number: 17
Loading