Finding the FrameStack: Learning What to Remember for Non-Markovian Reinforcement Learning

Published: 01 Jul 2025, Last Modified: 21 Jul 2025Finding the Frame (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Non-Markovian Decision Processes, Partial Observability, Bounded Agents, Memory-Constrained Agents, Temporal Abstraction, Adaptive Memory
TL;DR: Adaptive Stacking enables RL agents to efficiently handle long-term dependencies by learning what to remember, reducing memory and compute costs while preserving optimality.
Abstract: Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent's observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize the classic T-Maze domain, which gives us direct control over the degree of non-Markovian dependencies in the environment. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards and achieve convergence in the stack management policy without excessive removal of important experiences.
Submission Number: 37
Loading