Multiple-policy Evaluation via Density Estimation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study the multiple-policy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named CAESAR for this problem. Our approach is based on computing an approximately optimal sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. CAESAR has two phases. In the first phase, we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE objective. Up to low order and logarithmic terms CAESAR achieves a sample complexity $\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu^*_h(s,a)}\right)$, where $d^{\pi}$ is the visitation distribution of policy $\pi$, $\mu^*$ is the optimal sampling distribution, and $H$ is the horizon.
Lay Summary: We study the problem of evaluating the performance of multiple decision-making strategies (policies), known as multiple-policy evaluation. Evaluating a policy typically requires collecting data that reflects its behavior. A naive approach for multiple-policy evaluation is to evaluate each policy independently, which costs lots of data, as data collected for one policy cannot be reused for others. We propose a new algorithm designed to evaluate multiple policies more efficiently by leveraging the potential similarity between the policies. The algorithm operates in two phases: First, it quickly computes rough estimates of how each policy behaves using a small number of samples. Then, based on the rough estimates, we can compute an optimal strategy to collect data which will be used to evaluate the performance of all policies simultaneously. Our method reduces the data required compared to independent evaluation, especially when the number of policies is large. We provide the rigorous theoretical results for the multiple-policy evaluation problem which may also be of interest in broader contexts. In practice, our method can significantly lower costs in applications like robotics or healthcare where trying out different strategies is expensive.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Theoretical Reinforcement Learning; Policy Evaluation; Instance dependent results;
Submission Number: 13391
Loading