Regret Bounds for Log-Loss via Bayesian Algorithms

Changlong Wu, Mohsen Heidari, Ananth Grama, Wojciech Szpankowski

Published: 01 Jan 2023, Last Modified: 04 Sept 2023IEEE Trans. Inf. Theory 2023Readers: Everyone

Abstract: We study sequential probability assignment in the context of online learning under logarithmic loss and obtain tight lower and upper bounds for sequential minimax regret. Sequential minimax regret is defined as the minimum excess loss over data horizon <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$T$ </tex-math></inline-formula> that a predictor incurs over the best expert in a class, when the samples are presented sequentially and adversarially. Our upper bounds are established by applying Bayesian averaging over a novel “smooth truncated covering” of the expert class. This allows us to obtain tight (minimax) upper bounds that subsume the best known non-constructive bounds in an algorithmic fashion. For lower bounds, we reduce the problem to analyzing the fixed design regret via a novel application of Shtarkov sum adapted to online learning. We demonstrate the effectiveness of our approach by establishing tight regret bounds for a wide range of expert classes. In particular, we fully characterize the regret of generalized linear function with worst Lipschitz transform functions when the parameters are restricted to a unit norm <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\ell _{s}$ </tex-math></inline-formula> ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$s\ge 2$ </tex-math></inline-formula> ) ball of dimension <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula> . We show that the regret grows as <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\Theta (d\log T)$ </tex-math></inline-formula> when <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$d\le O(T^{s/(s+1)-\epsilon })$ </tex-math></inline-formula> for all <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\epsilon >0$ </tex-math></inline-formula> (with precise constant 1 when <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$d\le e^{o(\log T)}$ </tex-math></inline-formula> ) and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\tilde {O}(T^{s/(s+1)})$ </tex-math></inline-formula> when <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$d\ge \Omega (T^{s/(s+1)})$ </tex-math></inline-formula> . Finally, we show that the Bayesian approach may not always be optimal if the support of the prior is included in the reference class itself.

0 Replies