When Should Reinforcement Learning Use Causal Reasoning?

Oliver Schulte; Pascal Poupart

When Should Reinforcement Learning Use Causal Reasoning?

Oliver Schulte, Pascal Poupart

Published: 09 May 2025, Last Modified: 09 May 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement learning (RL) and causal reasoning naturally complement each other. The goal of causal reasoning is to predict the effects of interventions in an environment, while the goal of reinforcement learning is to select interventions that maximize the rewards the agent receives from the environment. Reinforcement learning includes the two most powerful sources of information for estimating causal relationships: temporal ordering and the ability to act on an environment. This paper provides a theoretical study examining which reinforcement learning settings we can expect to benefit from causal reasoning, and how. According to our analysis, the key factor is {\em whether the behavioral policy---which generates the data---can be executed by the learning agent}, meaning that the observation signal available to the learning agent comprises all observations used by the behavioral policy. Common RL settings with behavioral policies that are executable by the learning agent include on-policy learning and online exploration, where the learning agent uses a behavioral policy to explore the environment. Common RL settings with behavioral policies that are not executable by the learning agent include offline learning with a partially observable state space and asymmetric imitation learning where the demonstrator has access to more observations than the imitator. Using the theory of causal graphs, we show formally that when the behavioral policy is executable by the learning agent, conditional probabilities are causal, and can therefore be used to estimate expected rewards as done in traditional RL. However, when the behavioral policy is not executable by the learning agent, conditional probabilities may be confounded and provide misleading estimates of expected rewards. For confounded settings, we describe previous and new methods for leveraging causal reasoning.

Certifications: Survey Certification

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We received excellent detailed suggestions before our last revision, but not between our last revision and the decision. The changes are therefore mainly editing to optimize conciseness and clarity (e.g., make some paragraphs more concise). As directed by the action editor, we reviewed again the visible comments from reviewer UboP and made sure they were addressed (accessibility to non-causality theorists, sample size assumptions, reference [1]), including a footnote acknowledging the anonymous reviewer as the source of reference [1]. Reviewer UboP made special mention of the placement of floats. All floats are now the top or bottom (i.e. we used [bt] as options not [hbt]). The only exception is Table 4; placing it at the top disrupts an itemized list, so we let Latex place it with [hbp]. Our thanks again to the action editor and the reviewers for the very helpful work they have put it improving our paper.

Video: https://drive.google.com/file/d/1Jbfh5lf7SLREttEAXj_GuhCBd2btm49j/view?usp=sharing

Assigned Action Editor: ~Wilka_Torrico_Carvalho1

Submission Number: 3149

Loading