PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

ICLR 2026 Conference Submission13202 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: policy evaluation, confidence intervals, conformal prediction

TL;DR: We introduce two methods to construct valid confidence intervals for OPE methods that use both real and synthetic data.

Abstract: Off-policy evaluation (OPE) methods estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of OPE. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation within OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for ensuring safe and informed deployment. In this work, we propose two methods to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over $V^{\pi}(s)$, the policy performance conditioned on an initial state $s$. To do so we introduce a new conformal prediction method suitable for Markov Decision Processes (MDPs) with high-dimensional state spaces. Second, we consider the more common task of estimating the average policy performance over many initial states, $V^{\pi}$; we introduce a method that draws on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can effectively leverage auxiliary data and consistently produce confidence intervals that cover the ground truth policy values, unlike previously proposed methods. Our work enables a future in which OPE can provide rigorous uncertainty estimates for high-stakes domains.

Primary Area: reinforcement learning

Submission Number: 13202

Loading