Contextual Linear Bandits with Delay as Payoff

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We extend the delay-as-payoff model from stochastic multi-armed bandits to contextual linear bandits, proposing a novel phased arm elimination algorithm with strong theoretical guarantees.
Abstract: A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of order $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking the **volumetric spanners** of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.
Lay Summary: Sequential decision-making applications often need to deal with delayed action outcomes: think of medical treatments and online advertising, where the payoff is not immediate and its delay is proportional to the payoff. A recent study [Schlisselberg et al. 2025] explored this problem but only for very basic multi-armed bandits scenarios. In our work, we tackle its contextual linear variant, where each action's payoff depends on its time-varying feature embedding, making the problem more realistic. To address this, we introduce a novel and efficient algorithm with strong theoretical and empirical guarantees. Contrary to standard linear bandit algorithms that construct least squares estimators and confidence ellipsoids, the main novelty of our algorithm is its phased arm elimination procedure by only selecting the volumetric spanners of the action set. This approach effectively addresses challenges arising from both payoff-dependent delays and large action sets. Our research initiates the study of contextual linear bandits with payoff-dependent delays, opening doors to more complicated real-world scenarios, including evolving and composite delayed feedback.
Primary Area: Theory->Online Learning and Bandits
Keywords: online learning, delayed feedback, linear bandits
Submission Number: 8852
Loading