Geometric Resampling in Nearly Linear Time for Follow-the-Perturbed-Leader with Best-of-Both-Worlds Guarantee in Bandit Problems
TL;DR: The authors propose a novel technique, with which Follow-the-Perturbed-Leader policy achieves nearly linear time complexity without sacrificing the Best-of-Both-Worlds regret guarantee for bandit problems.
Abstract: This paper studies the complexity and optimality of Follow-the-Perturbed-Leader (FTPL) policy in the $K$-armed bandit problems. FTPL is a promising policy that achieves the Best-of-Both-Worlds (BOBW) guarantee without solving an optimization problem unlike Follow-the-Regularized-Leader (FTRL). However, FTPL needs a procedure called geometric resampling to estimate the loss, which needs $O(K^2)$ per-round average complexity, usually worse than that of FTRL. To address this issue, we propose a novel technique, which we call Conditional Geometric Resampling (CGR), for unbiased loss estimation applicable to general perturbation distributions. CGR reduces the average complexity to $O(K\log K)$ without sacrificing the regret bounds. We also propose a biased version of CGR that can control the worst-case complexity while keeping the BOBW guarantee for a certain perturbation distribution. We confirm through experiments that CGR does not only significantly improve the average and worst-case runtime but also achieve better regret thanks to the stable loss estimation.
Lay Summary: Imagine you're playing a game with several slot machines, each giving different rewards. Your goal is to win as much as possible, but you don't know which machine is the best at first, and you have limited chances to find out. You need a smart strategy to balance learning about the machines and earning rewards. This classic dilemma shows up in many real-world settings like online recommendations and clinical trials.
In this paper, we propose a new method that helps computers make better decisions in such uncertain situations. It speeds up the learning process by reducing the need for repeated trial and error, while still making accurate choices. Our approach works well even when the environment changes over time and is much faster than existing methods, making it more practical for real-world use.
Link To Code: https://github.com/BotaoChen123/FTPL-CGR
Primary Area: Theory->Online Learning and Bandits
Keywords: multi-armed bandit, follow-the-perturbed-leader, best-of-both-worlds, geometric resampling
Submission Number: 4589
Loading