\section{Conclusion}
This paper investigates off-policy evaluation in Markov Decision Processes from offline data collected by a different behavior policy, where unobserved confounding bias and no-overlap cannot be ruled out \emph{a priori}. This leads to violations of causal consistency (\Cref{def:_2_consist}), which could pose significant challenges to standard off-policy algorithms. We first extend the celebrated Bellman's equation to derive informative bounds over values functions from the observational data, which are robust against bias due to the presence of unobserved confounding and no-overlap. Based on these extended equations, we propose two novel model-free off-policy algorithms using eligibility traces -- one based on the standard temporal difference (\texttt{C-TD($\lambda$)}), and the other based on the tree-backup (\texttt{C-TB($\lambda$)}). These algorithms permit us to bound value functions from finite observations. Our simulation results show that standard off-policy RL algorithms cannot recover the actual value functions of a target policy when UCs and no-overlap generally exist. On the other hand, our proposed algorithms permit us to derive robust evaluations of the target value functions from imperfect offline observations.

\section*{Acknowledgements}
This research was supported in part by the NSF, ONR, AFOSR, DoE, Amazon, JP Morgan, and the Alfred P. Sloan Foundation.