Efficient and Robust Behavior Policy Search for Online Off-policy Evaluation through Transition Gradients

Claire Chen; Shuze Daniel Liu; Licheng Luo; Rohan Chandra; Nan Jiang; Shangtong Zhang

Efficient and Robust Behavior Policy Search for Online Off-policy Evaluation through Transition Gradients

Claire Chen, Shuze Daniel Liu, Licheng Luo, Rohan Chandra, Nan Jiang, Shangtong Zhang

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Policy Evaluation, Robust Reinforcement Learning, Behavior Policy Search, Variance Reduction, Transition Uncertainty, Adversarial Modeling, Min-max Optimization, Off-Policy Evaluation

TL;DR: We propose a robust behavior policy search framework that reduces variance in policy evaluation by explicitly accounting for transition uncertainty.

Abstract: In reinforcement learning policy evaluation, classic on-policy methods often suffer from high variance when estimating policy performance. To mitigate this issue, *behavior policy search* has been proposed to learn data-collecting policies tailored to reduce online evaluation variance. However, these approaches do not account for uncertainties in the transition functions. In practice, simulator transitions often differ from real world due to modeling errors or approximation limitations. As a result, behavior policies trained in simulation may still yield high variance when deployed in real environments, leading to costly reliance on real-world evaluation samples. In this work, we propose a double-loop gradient-based algorithm for learning behavior policies that are both efficient and robust to transition uncertainty. Theoretically, we derive novel transition-variance gradient expressions and establish global convergence guarantees for the algorithm. Numerically, we demonstrate that our method is less sensitive to transition perturbations than existing approaches, providing supportive evidence for its practical utility.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 6400

Loading