- Keywords: rl, off-policy, estimation
- Abstract: In many real-world reinforcement learning domains, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new target policy, accurate estimates of stationary distribution ratios – correction terms which quantify the likelihood that the target policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset – can improve accuracy and performance. In this work, we derive and study an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to collect the dataset. Furthermore, our algorithm eschews any use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation, and we find that our algorithm yields significant accuracy improvements compared to competing techniques.