Adversarial Combinatorial Semi-bandits with Graph Feedback

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We consider combinatorial bandits with graph feedback among the arms and prove tight regret characterization that interpolates nicely between known extreme cases.
Abstract: In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph $G$. We establish that the optimal regret over a time horizon $T$ scales as $\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST})$, where $S$ is the size of the combinatorial decisions and $\alpha$ is the independence number of $G$. This result interpolates between the known regrets $\widetilde\Theta(S\sqrt{T})$ under full information (i.e., $G$ is complete) and $\widetilde\Theta(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$ has only self-loops), where $K$ is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal.
Lay Summary: In many scenarios, we are making decisions while learning the factors that affect our preferences or gains, which is called online learning. We want to learn about those factors, but we also want to gain as much as possible during this process. For example, we have a few lotteries (numbered 1 to K) with different winning probabilities but same price, and we can draw any one of them every day. Then to get as much money as possible (say, in one year), we need to figure out whose probability is the highest as we go. This type of problem is studied under the name multi-armed bandits. In this work, we study a variant of them, in which we now can draw a few (say S) of them every day. How to maximize our gain? More importantly, if there is additional information, how can we leverage it? For example, imagine whenever you buy lottery k, your rich friend is always buying every lottery with a smaller number, namely 1 to k-1. And she will tell you her outcomes as well. This information structure forms a graph over the candidate actions (in this case, the lotteries we can purchase), and is called graph feedback. In this work, we mathematically the performance guarantees for any policy/strategy you can use to maximize your total gain, in the worst-case scenario. This is the so-called minimax regret bounds. We show how the worst-case performance guarantees relate to the graph structure, and we provide an optimal (in the worst-case sense) algorithm for you to achieve this. For any online decision-making process, such as bidding in advertising, online inventory control, recommendation systems, etc., you may use this algorithm to guarantee your total gain.
Primary Area: General Machine Learning->Online Learning, Active Learning and Bandits
Keywords: Combinatorial bandits, graph feedback, semi-bandit, adversarial bandits, statistical learning
Submission Number: 3302
Loading