Clarifying Uncertainty Quantification in Off-Policy Evaluation: Beyond Effective Sample Sizes, Towards Confidence Intervals
Keywords: Off-Policy Evaluation, Uncertainty Quantification, Effective Sample Size, Confidence Intervals, Importance Sampling, Offline Reinforcement Learning
TL;DR: Weight-based ESS diagnoses importance-weight concentration, not OPE uncertainty in general; confidence-interval width and empirical coverage are better candidates for cross-estimator uncertainty comparison.
Abstract: Off-policy evaluation (OPE) is often used to assess or compare whether a policy learned from previously collected behavior data is reliable enough to deploy, but good decisions require uncertainty diagnostics that reflect the actual error of the estimator being used. A common practice is to report the normalized-weight effective sample-size proxy
$\widehat{\mathrm{ESS}}=1 / \sum_{i} \bar w_i^2$,
and to treat it as evidence about the reliability of an OPE estimate. We argue that this interpretation is too broad: $\widehat{\mathrm{ESS}}$ is best understood as a practical approximation motivated by self-normalized importance sampling (SNIS), not as a universal uncertainty measure. First, even within importance sampling, it is incomplete because fixed normalized weights can correspond to different estimator variances when the reward variance changes; second, across direct, hybrid, and fitted evaluators such as DM, MRDR, and FQE, it is not estimator-agnostic at all. This leaves cross-estimator uncertainty comparison as an open problem. We therefore argue that interval-based summaries provide a more promising common language for comparison: confidence-interval width can be reported in practice, while empirical coverage provides a simulation-based calibration metric for evaluating whether intervals are valid.
Submission Number: 38
Loading