Keywords: Markov Decision Process, Reinforcement Learning, Theory
TL;DR: Learning in environments with dynamically revealed limits is a hunting game where novelty and eluder dimension determine whether the agent or environment wins.
Abstract: Markov Decision Processes (MDPs) address sequential decision-making under stochastic dynamics, where an agent selects actions, observes transitions, and aims to maximize rewards. Traditional reinforcement learning (RL) approaches assume a reasonably accurate estimate of the operating region in the state space. However, such an assumption rarely holds in real-world domains such as counter-drone defense and algorithmic trading, which feature environments whose limits of operation are only revealed gradually through interaction. As a result, the stochastic dynamics may push the agent into unfamiliar regions, where incomplete knowledge leads to suboptimal actions and reduced reward accumulation. This paper formulates this new phenomenon as a hunting game between the agent (hunter) and the environment (target). Its key motivation is that environments with heavy-tailed variability introduce rare but impactful surprises that slow down learning and act as implicit defenses, even without explicit adversarial presence. Despite its practical relevance, this setting remains poorly understood. In this paper, we analyze the theoretical limits of such hunting games in a model-based RL framework. Our work reveals that the difficulty of learning is governed by the novelty encountered by the agent, weighted by the eluder dimension of the environment’s true model class. Reducing either factor shifts the balance in favor of the agent.
Primary Area: reinforcement learning
Submission Number: 22198
Loading