Barycenter Policy Design for Multiple Policy Evaluation

Simon Weissmann; Till Freihaut; Claire Vernade; Giorgia Ramponi; Leif Döring

Barycenter Policy Design for Multiple Policy Evaluation

Simon Weissmann, Till Freihaut, Claire Vernade, Giorgia Ramponi, Leif Döring

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bandits, Importance Sampling, Behavior Policy

Abstract: A growing challenge in reinforcement learning is to efficiently explore the action space to evaluate multiple target policies using importance sampling. When target policies share similarities, leveraging these resemblances in the behavior policy is crucial for sample efficiency. However, formally defining and algorithmically utilizing such similarities remains an open problem. This article introduces a behavior policy design, examining how different criteria for selecting a behavior policy influence importance sampling estimator properties. We evaluate the resulting behavior policies in downstream tasks, particularly in best policy selection problems. Additionally, we demonstrate how effectively leveraging similarities among target policies results in a more nuanced behavior policy design and enhances regret bounds for best policy selection. To facilitate rigorous analysis, the article is formulated within the stochastic bandit framework.

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Till_Freihaut1

Track: Regular Track: unpublished work

Submission Number: 101

Loading