Barycenter Policy Design for Multiple Policy Evaluation

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Bandits, Importance Sampling, Behavior Policy
Abstract: A growing challenge in reinforcement learning is to efficiently explore the action space to evaluate multiple target policies using importance sampling. When target policies share similarities, leveraging these resemblances in the behavior policy is crucial for sample efficiency. However, formally defining and algorithmically utilizing such similarities remains an open problem. This article introduces a behavior policy design, examining how different criteria for selecting a behavior policy influence importance sampling estimator properties. We evaluate the resulting behavior policies in downstream tasks, particularly in best policy selection problems. Additionally, we demonstrate how effectively leveraging similarities among target policies results in a more nuanced behavior policy design and enhances regret bounds for best policy selection. To facilitate rigorous analysis, the article is formulated within the stochastic bandit framework.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Till_Freihaut1
Track: Regular Track: unpublished work
Submission Number: 101
Loading