Keywords: reinforcement learning, off-policy evaluation
Abstract: In the off-policy policy evaluation (OPE) problem, we want to estimate an agent's performance without online interaction with the environment, which is difficult due to out-of-distribution problems between the learned agent and the validation set of offline interactions. OPE is commonly done through importance weighting or model learning. In this paper, we propose an alternative OPE metric, focusing on the special (but relatively common) case of deterministic MDPs with sparse binary rewards, that uses a learned critic Q(s,a) as a classifier, using its accuracy to estimate return of the corresponding policy. Due to only requiring a Q-function estimate, we can apply the proposed metric to learning regimes where importance sampling or model fitting is difficult or infeasible. Experiments in toy and Atari environments show the metric correlates with return better than ad hoc approaches like the TD error. Turning an eye towards cross-domain generalization, we test the OPE metric on a difficult high-dimensional image-based real-world robot grasping setup. When applied to models trained only in simulation, the metric continues to correlate well with returns, even when the testing environment is in the real-world and uses objects not seen at training time. This opens the potential to heavily reduce real-robot usage when developing new models.
0 Replies
Loading