Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes

HyunJi Nam; Scott L Fleming; Emma Brunskill

Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes

HyunJi Nam, Scott L Fleming, Emma Brunskill

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 PosterReaders: Everyone

Keywords: Reinforcement Learning, Observation Cost, Markov Decision Process, MDP, Partially Observable Markov Decision Process, POMDP, Probably Approximately Correct, PAC, Healthcare, Health care

TL;DR: We provide theory and algorithms for efficient reinforcement learning in a special class of POMDPs exhibiting full observability (albeit at a cost) contingent on an agent's actions

Abstract: Many real-world problems that require making optimal sequences of decisions under uncertainty involve costs when the agent wishes to obtain information about its environment. We design and analyze algorithms for reinforcement learning (RL) in Action-Contingent Noiselessly Observable MDPs (ACNO-MDPs), a special class of POMDPs in which the agent can choose to either (1) fully observe the state at a cost and then act; or (2) act without any immediate observation information, relying on past observations to infer the underlying state. ACNO-MDPs arise frequently in important real-world application domains like healthcare, in which clinicians must balance the value of information gleaned from medical tests (e.g., blood-based biomarkers) with the costs of gathering that information (e.g., the costs of labor and materials required to administer such tests). We develop a PAC RL algorithm for tabular ACNO-MDPs that provides substantially tighter bounds, compared to generic POMDP-RL algorithms, on the total number of episodes exhibiting worse than near-optimal performance. For continuous-state ACNO-MDPs, we propose a novel method of incorporating observation information that, when coupled with modern RL algorithms, yields significantly faster learning compared to other POMDP-RL algorithms in several simulated environments.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

Code: https://github.com/nam630/acno_mdp

10 Replies

Loading