Offline policy selection under Uncertainty

Mengjiao Yang; Bo Dai; Ofir Nachum; George Tucker; Dale Schuurmans

Offline policy selection under Uncertainty

Mengjiao Yang, Bo Dai, Ofir Nachum, George Tucker, Dale Schuurmans

28 Sept 2020 (modified: 26 May 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Off-policy selection, reinforcement learning, Bayesian inference

Abstract: The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

One-sentence Summary: Formally defines offline policy selection in RL, and proposes Bayesian dual policy value posterior inference based on stochastic constraints, which enables a diverse set of policy selection algorithms under a wide range of evaluation metrics.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/offline-policy-selection-under-uncertainty/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=yN42qwq0w7

8 Replies

Loading