TL;DR: Inspired by log-sum-exponential operator, we propose a novel estimator for off-policy learning and evaluation under heavy-tailed assumption on weighted reward.
Abstract: Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded $(1+\epsilon)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ for the regret bounds, where $\epsilon\in[0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: https://github.com/armin-behnamnia/lse-offpolicy-learning .
Lay Summary: In many real-world applications, we often want to learn or evaluate decision-making systems (like recommending products or showing ads) using data that was collected in the past, rather than running new experiments. This setup is called off-policy learning and evaluation. The data usually includes the situation (context), the action taken, how likely that action was to be taken (called the propensity score), and the result (feedback or reward).
However, this approach can run into problems—especially when the recorded action probabilities are inaccurate or when the feedback is noisy and unpredictable. These issues can make the learning unstable and unreliable.
In this work, we propose a new method that uses a mathematical tool called the log-sum-exponential (LSE) operator. Compared to standard techniques, our method is more stable and less sensitive to noisy or extreme feedback. We provide mathematical guarantees showing how close our method’s results are to the best possible outcome, and we explain how this closeness improves as we get more data.
We also tested our method on a variety of tasks. The results show that it performs well in practice, especially in difficult situations where existing methods struggle.
Link To Code: https://github.com/armin-behnamnia/lse-offpolicy-learning
Primary Area: Reinforcement Learning->Batch/Offline
Keywords: off-policy learning, off-policy evaluation, log sum exponential, regret bound, estimation bound, concentration, bias and variance, robustness, heavy-tailed reward
Submission Number: 7084
Loading