An Adaptation of RLSVI with Explicit Action Sampling Probabilities

Published: 01 Jun 2024, Last Modified: 07 Aug 2024Deployable RL @ RLC 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Deployable RL; Posterior Sampling
TL;DR: We adapt RLSVI to provide explicit action sampling probability for after-study policy evaluation
Abstract: In real-world Reinforcement Learning (RL) deployment, the deployed online RL algorithms often need to collect datasets that enable offline policy evaluation for any target policy. Many offline policy evaluation approaches, use the Action Sampling Probabilities (ASPs), the conditional probabilities that the implemented RL algorithm used to select a particular action given all the previously observed states, actions and rewards. In the motivating digital health clinical trial, we originally planned to use the online Randomized Least Squares Value Iteration (RLSVI) algorithm for its robust empirical performance such settings. However RLSVI only has implicit ASPs as it utilizes external sources of randomness for exploration. To harness RLSVI's effective exploration while providing explicit ASPs, we propose to approximate the implicit ASPs of RLSVI, and sample actions directly using these approximations during the online learning. Computing the implicit ASPs is an exact Bayesian computation problem. We address this through Monte Carlo integration with importance sampling. We call this method RLSVI-IS (Importance Sampling). We evaluate RLSVI-IS on a simulation testbed built for the mobile health clinical trial. Our results demonstrate that RLSVI-IS not only achieves cumulative rewards comparable to those of RLSVI but also provides explicit ASPs. Moreover, we propose a sufficient condition that enables rigorous control over the distance between the explicit ASPs for RLSVI-IS and the implicit ASPs for RLSVI.
Submission Number: 15
Loading