Bandit and Delayed Feedback in Online Structured Prediction

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: online learning, structured prediction, bandit feedback, delayed feedback
Abstract: Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the *surrogate regret*, *i.e.,* the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, *bandit* and *delayed* feedback. For non-delayed bandit feedback, by using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of $O(\sqrt{KT})$ for the time horizon $T$ and the size of the output set $K$. However, $K$ can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of $O(T^{2/3})$, which is independent of $K$. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare these algorithms with existing ones.Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.
Submission Number: 133
Loading