Model-Based Exploration in Monitored Markov Decision Processes

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces a model-based algorithm for sequential decision-making in the face of partially observable rewards.
Abstract: A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e.g., a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for "unsolvable" Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that observable rewards are reliably captured, and another to learn the minimax-optimal policy. Second, we empirically demonstrate the advantages. We show faster convergence than prior algorithms in more than four dozen benchmarks, and even more dramatic improvements when the monitoring process is known. Third, we present the first finite-sample bound on performance. We show convergence to a minimax-optimal policy even when some rewards are never observable.
Lay Summary: Learning from trial-and-error is the most natural form of learning in humans. We learn by our mistakes and successes when they were not clear from the outset. This idea is at the core of artificial intelligence (AI). Artificial agents should be able to try different possibilities and learn from the outcomes. This approach is an alternative to being explicitly shown beforehand on how the task at hand should be accomplished. However, in many real-world scenarios the agents cannot have access to feedback to learn from, e.g., imagine the human supervisor leaving because of their time-constraints. This possibility has been overlooked in the past for developing AI methods for sequential decision-making. Our solution to this challenge was to drive a cautious behaviour when encountering unknowns. Our algorithm uses a classical approach to incentivize the agent to explore the world and try different possibilities to learn. But, if the agent recognized that some behaviour does not lead to feedback, then due to lack of knowledge it becomes cautious about them. This caution provides a worst-case guarantee and the agent has preemptively avoided the most catastrophic outcome. At the same time, if the agent recognized that everything can be learned, it never becomes prematurely cautious and try different actions and possibilities to learn as much as needed from them.
Link To Code: https://github.com/IRLL/Exploration-in-Mon-MDPs
Primary Area: Reinforcement Learning
Keywords: Exploration-Exploitation, Model-Based Interval Estimation, Monitored Markov Decision Processes
Submission Number: 12205
Loading