Track: Research Track
Keywords: Offline Bandits, Adversarial Attacks
Abstract: Bandit algorithms have recently become a popular tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model—often distributed with public weights on platforms such as Hugging Face—to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has emerged as an attractive alternative. However, the adversarial robustness of such offline bandit evaluation remains largely unexplored.
We investigate the vulnerability of offline bandit training to adversarial perturbations of the reward model. We introduce a novel threat model in which an attacker manipulates the reward function—using only offline data in high-dimensional settings—to hijack the bandit’s behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we show that even small, imperceptible perturbations to the reward model’s weights can drastically alter the algorithm’s behavior. Our analysis reveals a striking high-dimensional effect: as input dimensionality grows, the required perturbation norm decreases, making modern applications (e.g., image evaluation) especially vulnerable.
Extensive experiments confirm that naive random perturbations are ineffective, but carefully targeted ones succeed with near-perfect attack success rates. To address computational challenges, we propose efficient heuristics that achieve almost 100\% success while dramatically reducing attack cost. Finally, we validate our approach on the UCB bandit and provide theoretical evidence that adversaries can delay optimal arm selection proportionally to the input dimension.
Submission Number: 82
Loading