Abstract: In off-policy reinforcement learning, a behaviour policy performs exploratory interactions with the environment to obtain state-action-reward samples which are then used to learn a target policy that optimises the expected return. This leads to a problem of off-policy evaluation, where one needs to evaluate the target policy from samples collected by the often unrelated behaviour policy. Importance sampling is a traditional statistical technique that is often applied to off-policy evaluation. While importance sampling estimators are unbiased, their variance increases exponentially with the horizon of the decision process due to computing the importance weight as a product of action probability ratios, yielding estimates with low accuracy for domains involving long-term planning. This paper proposes state-based importance sampling (SIS), which drops the action probability ratios of sub-trajectories with "negligible states" -- roughly speaking, those for which the chosen actions have no impact on the return estimate -- from the computation of the importance weight. Theoretical results demonstrate a smaller exponent for the variance upper bound as well as a lower mean squared error. To identify negligible states, two search algorithms are proposed, one based on covariance testing and one based on state-action values. Using the formulation of SIS, we then analogously formulate state-based variants of weighted importance sampling, per-decision importance sampling, and incremental importance sampling based on the state-action value identification algorithm. Moreover, we note that doubly robust estimators may also benefit from SIS. Experiments in two gridworld domains and one inventory management domain show that state-based methods yield reduced variance and improved accuracy.
0 Replies
Loading