Distributional Off-policy Evaluation with Bellman Residual Minimization

Published: 22 Jan 2025, Last Modified: 03 Oct 2025AISTATS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Our paper is about theoretical analysis on distributional off-policy evaluation for infinite-horizontal setting, based on expectation-extended distance.
Abstract: We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.
Full Paper: https://proceedings.mlr.press/v258/hong25c.html
Submission Number: 1513
Loading