Balancing Context Length and Mixing Times for Reinforcement Learning at Scale

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixing Times, Non-Markovian, Average Reward, Long Context, Partial Observability, Causal Structure
Abstract: Due to the recent remarkable advances in artificial intelligence, researchers have begun to consider challenging learning problems such as learning to generalize behavior from large offline datasets or learning online in non-Markovian environments. Meanwhile, recent advances in both of these areas have increasingly relied on conditioning policies on large context lengths. A natural question is if there is a limit to the performance benefits of increasing the context length if the computation needed is available. In this work, we establish a novel theoretical result that links the context length of a policy to the time needed to reliably evaluate its performance (i.e., its mixing time) in large scale partially observable reinforcement learning environments that exhibit latent sub-task structure. This analysis underscores a key tradeoff: when we extend the context length, our policy can more effectively model non-Markovian dependencies, but this comes at the cost of potentially slower policy evaluation and as a result slower downstream learning. Moreover, our empirical results highlight the relevance of this analysis when leveraging Transformer based neural networks. This perspective will become increasingly pertinent as the field scales towards larger and more realistic environments, opening up a number of potential future directions for improving the way we design learning agents.
Primary Area: Reinforcement learning
Submission Number: 14006
Loading