Keywords: Off-policy evaluation, Domain Adaptation, Spectral Methods
Abstract: Contextual bandits capture the partial-feedback nature in an interactive system.
Algorithms for contextual bandits have wide applications in automated decision making such as recommender system and automated stock trading.
Evaluating the cumulative reward of a target policy given the historical trajectories of a logging policy (i.e. off-policy evaluation) in contextual bandit setting is a task of importance, as it provides an estimate of the performance of a new policy without experimenting with it.
One (common and well-studied) solution is the Inverse Propensity Score (IPS) estimator. The idea of such methods is to estimate the expectation through importance sampling (i.e. re-weighting the data with a ratio associated with the logging and evaluation policy).
Existing work assumes the stationarity of the distribution over context space, which is not always true in a real-world scenario.
More practical modeling considers the shift of context/reward distributions between the logged data and the contexts observed in order to evaluate a target policy in the future.
Such a problem is difficult in general due to the high-Dimensionality of the context space, as observed in our experiments.
In this paper, we propose an intent shift model which proposes to introduce an intent variable to capture the distributional shift on context and reward. Under the intent shift model, we propose a consistent spectral estimator for the reweighting factor and its finite-sample analysis and provide an MSE bound on the performance of our final estimator. Experiments show that our estimator outperforms the existing ones.
3 Replies
Loading