\section{Introduction}\label{sec:_1}
A typical reinforcement learning agent learns from past data, i.e., from observed trajectories of states, actions, and reward signals generated by the agent intervening in the underlying environment. This data reflects the influence of the decision-making policy used to allocate actions based on the observed state, which is called the \emph{behavior policy}. This policy might be selected by the agent in the past or by a different demonstrator operating in the same environment. \emph{Policy evaluation} studies the problem of evaluating the effectiveness of a candidate \emph{target policy} from the combination of past data and theoretical assumptions about the environment. When the behavior and target policies coincide, the evaluation is called \emph{on-policy} learning, in which the expected return of candidate policies given the agent's starting state (i.e., the value function) could be directly estimated with empirical means \citep{sutton1998reinforcement}. In practice, however, the learner might have to learn about policies different from the currently deployed one that generated the data, leading to the \emph{off-policy} learning problem.

Off-policy learning is a popular area of research, as it allows for more efficient learning by using data from different policies. Several algorithms have been proposed for off-policy evaluation from finite observations, including Q-learning \citep{watkins1989learning,watkins1992q}, importance sampling \citep{swaminathan2015counterfactual,jiang2015doubly}, and temporal difference \citep{precup2000eligibility,munos2016safe}. These algorithms rely on two critical assumptions about the behavior policy. First, no unobserved confounder affects the behavior policy's selected action and the subsequent state and reward. Second, the behavior policy is stochastic, covering all intended actions the target policy selects given all observed states. When either of these assumptions does not hold, the effect of the target policy is generally not \emph{identifiable}, i.e., the model assumptions are insufficient to uniquely determine the value function from the offline data \citep{pearl:2k,zhang2019near}.

In recent times, researchers have been using partial identification methods to obtain reliable off-policy evaluation in situations where there are unobserved confounders, and the behavior and target policies have no common support \citep{kallus2018confounding,zhang2019near,kallus2020confounding,namkoong2020off,khan2023off,bruns2023robust,kausik2024offline}. Partial identification is a well-studied problem in causal inference \citep{balke:pea97,zhang2022partial,zhang2021bounding}, econometrics \citep{imbens1997bayesian,poirier1998revising,romano2008inference,stoye2009more,bugni2010bootstrap,todem2010global,moon2012bayesian}, and dynamical systems \citep{bajari2007estimating,norets2014semiparametric,dickstein2018exporters,morales2019extended,berry2023instrumental}. It enables the derivation of informative bounds on target effects from confounded observational data. Several model-based algorithms have been proposed, which estimate the underlying system dynamics from offline data based on a combination of conditions and constraints. These include (1) the marginal sensitivity model that assumes access to a bound over the odds ratio between the nominal and actual behavioral policies \citep{kallus2018confounding,kallus2020confounding,namkoong2020off,khan2023off,bruns2023robust}; (2) parametric knowledge about the system dynamics (i.e., reward function and transition distribution) are invoked under which informative bounds are derived \citep{kausik2024offline}; (3) the decision horizon is finite, i.e., the agent only determines a finite number of actions \citep{kallus2018confounding,zhang2019near,namkoong2020off,khan2023off,kausik2024offline}. We refer readers to the complete technical report \citep[Appendix A]{zhang2024eligibility} for a more detailed survey.

This paper contributes to this growing line of literature by studying model-free algorithms for robust off-policy evaluation over an infinite horizon from confounded offline data generated by a behavioral policy with no overlap support. We propose novel partial identification algorithms using eligibility traces to obtain informative bounds over the expected return of candidate policies from offline data generated from an unknown Markov decision process where the unobserved confounders exist, and overlap does not hold. 

More specifically, our contributions are summarized as follows. (1) We extend the Bellman equation that permits one to derive optimal bounds over target value functions from the observational distribution generated by an unknown behavior policy. (2) We propose a novel off-policy temporal difference algorithm (\texttt{C-TD($\lambda$)}) using eligibility traces to estimate bounds over the state value function from finite observations contaminated with unobserved confounding and no overlap. (3) We introduce an alternative eligibility trace algorithm following tree backup (\texttt{C-TB($\lambda$)}) that obtains bounds over the state-action value function from biased observations. Finally, we evaluate our proposed algorithms using extensive simulations in synthetic environments. All proofs and details of the experiment setup are provided in the technical report \citep{zhang2024eligibility}.

\paragraph{Notations.} We use capital letters to denote random variables ($X$), small letters for their values ($x$) and $\1X$ for the domain of $X$. For an arbitrary set $\*X$, let $|\*X|$ be its cardinality. Fix indices $i, j \in \3N$. Let $\bar{\*X}_{i:j}$ stand for a sequence $\{X_i, X_{i+1}, \dots, X_j\}$. We denote by $P(\*X)$ a probability distribution over variables $\*X$. Similarly, $P(\*Y \mid \*X)$ represents a set of conditional distributions $P(\*Y \mid \*X = \*x)$ for all realizations $\*x$. We consistently use $P(\*x)$ as abbreviations of probabilities $P(\*X = \*x)$; so does $P(\*Y = \*y \mid \*X = \*x) = P(\*y \mid \*x)$. Finally, $\I_{\*Z = \*z}$ is an indicator function that returns $1$ if event $\*Z = \*z$ holds true; otherwise, it returns $0$. 

%An SCM $M$ is a tuple $\tuple{\*V, \*U, \1F, P(\*U)}$, where $\*V$ is a set of endogenous variables and $\*U$ is a set of exogenous variables  \citep{pearl:2k, bareinboim2022pearl}. $\1F$ is a set of functions s.t. each $f_V \in \1F$ decides values of an endogenous variable $V \in \*V$ taking as argument a combination of other variables in the system. That is, $V \leftarrow f_{V}(\*\PA_V, \*U_V), \*\PA_V \subseteq \*V, \*U_V \subseteq \*U$. Values of exogenous variables $U \in \*U$ are drawn from the exogenous distribution $P(\*U)$. Naturally, $M$ induces an \emph{observational distribution} $P(\*V)$. An intervention on a subset $\*X \subseteq \*V$, denoted by $\doo(\*x)$, is an operation where values of $\*X$ are set to constants $\*x$, replacing the functions $\{f_{X}: \forall X \in \*X\}$ that would normally determine their values. For an SCM $M$, let $M_{\*x}$ be a submodel of $M$ induced by intervention $\doo(\*x)$. For a set $\*Y \subseteq \*V$, the \emph{interventional distribution} $\inv{\*Y}{\*x}$ induced by $\doo(\*x)$ is defined as the joint distribution over $\*Y$ in the submodel $M_{\*x}$, i.e., $\inv{\*Y; M}{\*x} \triangleq P \Parens{\*Y;M_{\*x}}$.

%The basic semantical framework of our analysis rests on \textit{structural causal models} (SCMs) \citep{pearl:2k, bareinboim2022pearl}. An SCM $M$ is a tuple $\tuple{\*V, \*U, \1F, P(\*U)}$, where $\*V$ is a set of endogenous variables and $\*U$ is a set of exogenous variables. $\1F$ is a set of functions s.t. each $f_V \in \1F$ decides values of an endogenous variable $V \in \*V$ taking as argument a combination of other variables in the system. That is, $V \leftarrow f_{V}(\*\PA_V, \*U_V), \*\PA_V \subseteq \*V, \*U_V \subseteq \*U$. Exogenous variables $U \in \*U$ are mutually independent, values of which are drawn from the exogenous distribution $P(\*U)$. Naturally, $M$ induces a joint distribution $P(\*V)$ over endogenous $\*V$, called the \emph{observational distribution}. 

%An intervention on a subset $\*X \subseteq \*V$, denoted by $\doo(\*x)$, is an operation where values of $\*X$ are set to constants $\*x$, replacing the functions $\{f_{X}: \forall X \in \*X\}$ that would normally determine their values. For an SCM $M$, let $M_{\*x}$ be a submodel of $M$ induced by intervention $\doo(\*x)$. For a set $\*Y \subseteq \*V$, the interventional distribution $\inv{\*Y}{\*x}$ induced by $\doo(\*x)$ is defined as the joint distribution over $\*Y$ in the submodel $M_{\*x}$, i.e., $\inv{\*Y; M}{\*x} \triangleq P \Parens{\*Y;M_{\*x}}$. We leave $M$ implicit when it is obvious from the context. For a detailed survey on SCMs, we refer readers to \cite[Ch.~7]{pearl:2k}.