For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent
Abstract: We study the problem of multiclass online learning under $\textit{bandit feedback}$ within the framework of $\textit{universal learning}$ [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC '21].
In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is $\textit{uniformly}$ learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of $\textit{universal}$ learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings.
More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.
PDF: pdf
Submission Number: 124
Loading