Keywords: reinforcement learning, healthcare, medical decision making, sparse rewards, value function, policy learning, reward design, probabilistic interpretation
TL;DR: We prove an equivalence relationship for three sparse reward designs commonly used in healthcare RL, and empirically probe the assumptions required.
Abstract: In reinforcement learning (RL) for healthcare, reward functions often encode clinical endpoints like survival and death. This results in a sparse reward structure with non-zero rewards only at terminal transitions. However, the exact numerical rewards assigned to survival and death vary in existing literature, raising concerns about whether they will end up optimizing for the same objective. In this work, we theoretically and empirically examine three common sparse reward designs: survival-only, death-only, and mixed. We prove that, under the assumptions of terminal-only rewards, guaranteed absorption, and no discounting, the corresponding value functions of the three designs have an equivalence relationship and lead to the same optimal policy. We verify these theoretical results in randomly generated MDPs and demonstrate how relaxing these assumption affect the equivalence relationship. Finally, we consider a more complex grid-world domain in which the assumptions are violated, where we found the survival-only and mixed designs consistently lead to better policies than the death-only design. Our findings provide important initial insights into the choices of sparse reward designs
and how they shape policy learning in healthcare RL applications.
Submission Number: 10
Loading