Noise as a Natural Regularizer in Markov Decision Processes: Connecting Environmental Stochasticity and Policy Simplicity
TL;DR: Transition noise decreases the effective discount factor for MDPs
Abstract: The planning horizon in a Markov Decision Process (MDP) determines how far into the future an agent reasons. In practice, shorter horizons are commonly associated with policies that exhibit simpler or more interpretable decision-making behavior. In this paper, we establish a formal connection between environmental stochasticity and planning horizon in MDPs. We show that for broad classes of transition noise, solving a noisy MDP can be formally related to solving a noise-free MDP with a shorter effective discount factor, leading to identical optimal policies in some cases and near-optimal ones in others. We further characterize settings in which this correspondence breaks down, clarifying when horizon-based interpretations of noise are not valid. These results, which are supported by both theory and experiments, also give some insight into the common practice of using smaller discount factors for reinforcement learning than those that can be justified by standard modeling interpretations.
Lay Summary: Optimal behavior often requires balancing present and future rewards with a discount applied to future rewards. Agents that place a high value on future rewards will need more complicated plans to achieve these rewards, while agents that discount the future more heavily can make shallower or simpler plans for the future. In this paper, we establish a connection between randomness in the outcome of actions and how much an agent can discount future rewards. We show that various kinds of randomness can effectively decrease how much an agent values future rewards by extending the time it takes for the agent to carry out its plans, creating a direct relationship between randomness and future discounting. We support this with theoretical results and concrete experiments, and we additionally characterize the situations in which randomness does not necessarily decrease the value of future rewards.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/Extile1/MDP_Noise_Regularization
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Markov Decision Process, Noise, Interpretability, Discount Factor, Planning Horizon
Originally Submitted PDF: pdf
Submission Number: 17630
Loading