From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

TMLR Paper6353 Authors

01 Nov 2025 (modified: 01 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator. The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW. In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance. To further reduce variance, we incorporate reward predictions -- similar to the DR technique -- resulting in the Model-assisted Nonparametric Weighting (MNW) approach. The MNW approach yields accurate value estimates by explicitly modeling and mitigating bias from reward modeling, without aiming to guarantee the standard doubly robust property. Extensive empirical comparisons show that our approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Thank you very much for the three reviewers’ helpful and constructive comments, as well as the AE’s contribution. In response to these comments, we have revised the manuscript accordingly. The changes are highlighted in blue in the revised paper. Below, we summarize the main revisions. We would greatly appreciate any further comments on the revised manuscript. * on the motivation We have presented our representation results for off-policy evaluation in Section 3.1.1. Define \begin{equation} f^{\pi}(p_{ia})=\mathbb{E}[\pi_{ia}r_{ia} | p_{ia}]. \end{equation} Based on this definition, we obtain the following representation results. Using the definition of $f^{\pi}(p_{ia})$ and the law of total expectation, we obtain \begin{equation} V^{\pi} = \mathbb{E}[p_{ia_i}^{-1}f^{\pi}(p_{ia_i})]; \end{equation} and \begin{equation} V^{\pi}= \mathbb{E}\left[f^{\pi}(p_{ia})\right]. \end{equation} The first result shows that $f^{\pi}(\cdot)$ possesses a design-based representation analogous to the IPW estimator, and the second one shows that it also admit a model-based representation analogous to the DM estimator. If we instead define $f^{r}(p_{ia})=\mathbb{E}[r_{ia} | p_{ia}]$, the resulting quantity $\pi_{ia}f^{r}(\cdot)$ does not inherit these representation properties. * on the model framework We have provided an additional justification for why the dataset suffice to recover the representation of $f^{\pi}(\cdot)$ in Section 3.1.2. In the off-policy evaluation problem, the data are assumed to be collected under a mechanism in which the action assignment $a_i$ for unit $i$ relies only on $p_{ia}$. That is, conditional on $p_{ia}$, the action assignment $a_i$ is independent of $\pi_{ia}r_{ia}$. Therefore, from the representation result of $f^{\pi}(\cdot)$, we have the model framework, which links $\pi_{ia}r_{ia}$ to $p_{ia}$ based on the data set $\mathcal{S}$. * on the illustrative example We have rewritten the reward-generation process for clarity in Section 3.6. We also conducted a toy simulation to illustrate the potential advantage of the MNW estimator over the NW estimator in Section 4.2. * on Experiments in Section 5. We conducted an experiment in which the logging policy is estimated; the corresponding performance results are reported in Table 5 in the appendix. The results show that our approaches perform consistently across different settings. We further examine performance under varying sample sizes and report results for a representative dataset, Page, in Figure 1 in the appendix. These results demonstrate that our approaches are robust to the sample size used for policy evaluation. We also added more details to replicate them. For example, we provide a summary of the datasets used for policy evaluation. * on Robust to the estimation of $p_{ia}$. In the revised paper, we provide additional discussion on robustness to behavior policy estimation in Section 3.4. We separately investigate the impact of estimation error and the impact of bias arising from model misspecification. In particular, for the latter, we explain why our approach is robust to behavior policy estimation.
Assigned Action Editor: ~Inigo_Urteaga1
Submission Number: 6353
Loading