Using Confounded Data in Latent Model-Based Reinforcement Learning
Abstract: In the presence of confounding, naively using off-the-shelf offline reinforcement learning (RL) algorithms leads to sub-optimal behaviour. In this work, we propose a safe method to exploit confounded offline data in model-based RL, which improves the sample-efficiency of an interactive agent that also collects online, unconfounded data. First, we import ideas from the well-established framework of $do$-calculus to express model-based RL as a causal inference problem, thus bridging the gap between the fields of RL and causality. Then, we propose a generic method for learning a causal transition model from offline and online data, which captures and corrects the confounding effect using a hidden latent variable. We prove that our method is correct and efficient, in the sense that it attains better generalization guarantees thanks to the confounded offline data (in the asymptotic case), regardless of the confounding effect (the offline expert's behaviour). We showcase our method on a series of synthetic experiments, which demonstrate that a) using confounded offline data naively degrades the sample-efficiency of an RL agent; b) using confounded offline data correctly improves sample-efficiency.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Supplementary Material: pdf
Changes Since Last Submission: The major changes are listed below: - Limitations of the theoretical results (ccoi and iFKF): both reviewers have pointed to limitations in our theoretical results, mostly in the fact that the upper bound in Theorem 1 is not tight and collapses to 1 as $T \to \infty$, and also that our results are only asymptotic and do not provide any guarantee in the finite-sample regime. We have added a new Section~4.5 "Limitations of the provided guarantees" which acknowledges and discusses these points. - Proof of Theorem 1 is hard to read (ccoi and 9ush): we have considerably reworked and simplified the proof to make it easier to follow. - The paper is restricted to latent model-based RL (iFKF and 9ush): we have renamed the paper ``Using Confounded Data in Latent Model-Based Reinforcement Learning'', and we have re-written part of the introduction to make it clearer that the scope of the paper is restricted to model-based RL. We also mention now the possibility of extending our method to latent-free model-based RL in the Discussions section. - No mention of previous works in the introduction (reviewer 9ush): we now briefly mention previous works in the first paragraph of the introduction, and point readers directly to section 6 for a discussion of related works. Minor changes requested by the AC: - we have replaced our formal results Proposition1 and Corollary 1 with a less formal analysis of the theoretical properties of our approach in the asymptotic regimes.
Assigned Action Editor: ~Martha_White1
Submission Number: 962