Keywords: Meta-Reinforcement Learning, Exploration–Exploitation Tradeoff, Emergent Exploration, Transformers in RL, Pseudo-Thompson Sampling
TL;DR: This paper investigates when exploration can emerge naturally from greedy, reward-maximizing objectives in meta-reinforcement learning.
Abstract: Traditional reinforcement learning (RL) methods encourage exploration by adding incentives such as randomization, uncertainty bonuses, or intrinsic rewards. Interestingly, meta-reinforcement learning (meta-RL) agents can develop exploratory behavior even when trained with a purely greedy objective. This raises the question: under what conditions does greedy reward-seeking behavior lead to information-seeking behavior? We hypothesize that three ingredients are essential: (1) Recurring Environmental Structure, where environments generate repeatable patterns that can be exploited if discovered; (2) Agent Memory, which enables past interactions to guide future performance; and (3) Long-Horizon Credit Assignment, which allows the delayed benefits of exploration to shape present decisions. Experiments in stochastic multi-armed bandits and temporally extended gridworlds demonstrate the need for recurrence, memory, and long-term credit. In short-horizon settings, however, exploration can arise from a Pseudo-Thompson Sampling effect, which mimics posterior sampling and obscures the role of temporal credit. In contrast, long-horizon environments reveal that explicit Long-Horizon Credit Assignment substantially improves returns. Our results suggest that structure, memory, and long horizons are critical for greedy training to induce exploration, highlighting these factors as key design considerations for effective meta-agents.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 21801
Loading