Leveraging diverse offline data in POMDPs with unobserved confounders

Oussama Azizi; Philip Boeken; Onno Zoeter; Frans A Oliehoek; Matthijs T. J. Spaan

Leveraging diverse offline data in POMDPs with unobserved confounders

Oussama Azizi, Philip Boeken, Onno Zoeter, Frans A Oliehoek, Matthijs T. J. Spaan

Published: 01 Aug 2024, Last Modified: 09 Oct 2024EWRL17EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Causal Inference, POMDP, Latent Confounding

Abstract: In many Reinforcement Learning (RL) applications, offline data is readily available before an algorithm is deployed. Often, however, data-collection policies have had access to information that is not recorded in the dataset, requiring the RL agent to take unobserved confounders into account. We focus on the setting where the confounders are i.i.d. and, without additional assumptions on the strength of the confounding, we derive tight bounds for the causal effects of the actions on the observations and reward. In particular, we show that these bounds are tight when we leverage multiple datasets collected from diverse behavioral policies. We incorporate these bounds into Posterior Sampling for Reinforcement Learning (PSRL) and demonstrate their efficacy experimentally.

Submission Number: 16

Loading