Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning

Tianpai Luo; Xinyuan Fan; Weichi Wu

Simultaneous Statistical Inference for Off-Policy Evaluation in Reinforcement Learning

Tianpai Luo, Xinyuan Fan, Weichi Wu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: off-policy evaluation, simultaneous confidence region, convex Gaussian approximation, bootstrap, reinforcement learning

TL;DR: This work presents the first asymptotically correct simultaneous confidence region for off-policy evaluation in reinforcement learning.

Abstract: This work presents the first theoretically justified simultaneous inference framework for off-policy evaluation (OPE). In contrast to existing methods that focus on point estimates or pointwise confidence intervals (CIs), the new framework quantifies global uncertainty across an infinite or continuous initial state space, offering valid inference over the entire state space. Our method leverages sieve-based Q-function estimation and (high-dimensional) Gaussian approximation techniques over convex regions, which further motivates a new multiplier bootstrap algorithm for constructing asymptotically correct simultaneous confidence regions (SCRs). The widths of the SCRs exceed those of the pointwise CIs by only a logarithmic factor, indicating that our procedure is nearly optimal in terms of efficiency. The effectiveness of the proposed approach is demonstrated through simulations and analysis of the OhioT1DM dataset.

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 21761

Loading