Batch Learning via Log-Sum-Exponential Estimator from Logged Bandit Feedback

Published: 19 Jun 2024, Last Modified: 26 Jul 2024ARLET 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, off-policy learning, non-linear estimator, generalization error
TL;DR: We introduce and examine a novel estimator designed for batch learning from the logged bandit feedback dataset
Abstract: Offline policy learning methods in batch learning aim to derive a policy from a logged bandit feedback dataset, encompassing context, action, propensity score and feedback for each sample point. To achieve this objective, inverse propensity score estimators are employed to minimize the cost. However, this approach is susceptible to high variance and poor performance under low-quality propensity scores. In response to these limitations, we propose a novel estimator inspired by the log-sum-exponential operator, mitigating variance. Furthermore, we offer theoretical analysis, encompassing upper bounds on the bias, variance of our estimator, and an upper bound on the generalization error of the log-sum-exponential estimator—the difference between the empirical risk of the log-sum-exponential estimators and the true risk- with a convergence rate of $O(1/\sqrt{n})$ where $n$ is the number of training samples. Additionally, we examine the performance of our estimator under limited access to clean propensity scores and an imbalanced logged bandit feedback dataset, where the number of samples per action is different.
Submission Number: 21
Loading