DualDICE: Efficient Estimation of Off-Policy Stationary Distribution Corrections

Ofir Nachum; Yinlam Chow; Bo Dai; Lihong Li

DualDICE: Efficient Estimation of Off-Policy Stationary Distribution Corrections

Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

Published: 28 May 2019, Last Modified: 05 May 2023RL4RealLife 2019Readers: Everyone

Keywords: rl, off-policy, estimation

Abstract: In many real-world reinforcement learning domains, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new target policy, accurate estimates of stationary distribution ratios – correction terms which quantify the likelihood that the target policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset – can improve accuracy and performance. In this work, we derive and study an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to collect the dataset. Furthermore, our algorithm eschews any use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation, and we find that our algorithm yields significant accuracy improvements compared to competing techniques.

3 Replies

Loading