Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.
Lay Summary: Many real-world decision-making systems—like those used in online advertising or wireless networks—face unpredictable rewards that can be very large or "heavy-tailed." This makes it difficult for standard learning algorithms to make good decisions, since they are designed assuming rewards are relatively well-behaved and bounded. We tackle this problem by designing new algorithms for contextual bandits, a type of learning model that helps an agent choose actions based on observed data. Our algorithms use a statistical tool called the Catoni estimator to achieve robustness even when the observed rewards have a large range and normal variance. We provide two versions: one for when the reward variance is known in advance, and one that works even when it is not. Our methods achieve strong performance variance-dependent guarantees that enjoy logarithmic order on the worst-case range, meaning they are much more accurate and efficient in realistic scenarios. These results push the boundary of robust reinforcement learning and make it more practical for applications involving unreliable or extreme feedback, such as recommendation systems, finance, and networking.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Heavy-tailed Rewards, Contextual bandits, Genenral function approximation
Submission Number: 12097
Loading