Causality in Goal Conditioned RL: Return to No Future?

Published: 03 Nov 2023, Last Modified: 27 Nov 2023GCRL WorkshopEveryoneRevisionsBibTeX
Confirmation: I have read and confirm that at least one author will be attending the workshop in person if the submission is accepted
Keywords: causal inference, structural equation models, goal-conditioned RL, goal-conditioned RL with supervised learning
Abstract: The main goal of goal-conditioned RL (GCRL) is to learn actions that maximize the conditional probability of achieving the desired goal from the current state. To improve sample efficiency, GCRL utilizes either 1) imitation learning with expert demonstrations or 2) supervised learning with self-imitation, denoted goal-conditioned RL with supervised learning (GCRL-SL). The GCRL-SL algorithms directly estimate the probability of actions ($A=a$) given the current state ($S=s$), and a future, observed goal ($G=g$) from batch data generated under a behavior policy. Subsequently, the optimal action maximizes an estimate of $P(A \mid S=s, G=g)$. One crucial insight missing from empirical and theoretical work on GCRL relates to the causal interpretation of the policy learned by GCRL algorithms. In this study, we begin exploring a crucial question for ensuring safe and robust decision-making: What causal biases arise in the GCRL training process and when can these causal biases lead to a poor policy? Our theoretical and empirical analysis demonstrates that GCRL algorithms can result in learning poor policies when the training data follows particular causal graphs. This issue is particularly problematic when implementing GCRL in environments with potential unmeasured confounding, as often encountered in healthcare and mobile health applications.
Supplementary Material: pdf
Submission Number: 28