Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme.
Lay Summary: Many real-world decision-making systems—such as recommendation algorithms, robotics, or personalized AI assistants—need to evaluate how well multiple strategies perform across diverse goals simultaneously. Traditionally, this process can become extremely resource-intensive, requiring significant time and data to obtain accurate results. We developed an approach that simultaneously evaluates multiple strategies across multiple objectives in a reliable and efficient way. Our method strategically chooses how to gather information at each step, ensuring accurate results with minimal data. This enables quicker, more reliable insights, improving how we develop systems that need to balance multiple goals—such as improving user satisfaction while minimizing costs or environmental impacts.
Link To Code: https://github.com/rssalessio/multi-reward-multi-policy-eval
Primary Area: Reinforcement Learning
Keywords: policy evaluation, multi-reward, multi-policy, adaptive exploration, pure exploration, reinforcement learning
Submission Number: 8246
Loading