Keywords: pure exploration, zero-sum games, mixed-strategy equilibrium, Frank-Wolfe, differential inclusion
TL;DR: We cast pure-exploration bandit problems as a two-player mixed-strategy game, and propose Frank–Wolfe Self-Play—simultaneous, one-hot, projection-free updates—with provable convergence via continuous-time Lyapunov (stochastic-approximation) analysis.
Abstract: We study pure exploration in structured bandits through a game-theoretic lens. For a broad class of queries, the asymptotic instance complexity admits a maximin form that corresponds to a two-player zero-sum game: an experimenter who allocates measurements to rule out alternatives versus a skeptic who proposes them. Allowing the skeptic to mix turns this into a concave–convex saddle-point problem. This reformulation unlocks a simple, projection-free, tuning-free, bandit-friendly algorithm: Frank–Wolfe Self-Play, in which both players take one-hot Frank–Wolfe steps against the other's empirical mixed strategy. Our analysis proceeds via continuous-time dynamics: the limiting differential inclusion admits a Lyapunov function that decays exponentially, yielding a vanishing duality gap and convergence of the game value. We then embed the discrete updates into a perturbed flow and, using stochastic-approximation techniques, show that the game value converges. A linear-bandit case study highlights phenomena absent in unstructured settings—nonunique optima, optimal allocations with zero mass on the best arm, bilinear objectives, and boundary nonsmoothness—and shows how Frank–Wolfe Self-Play automatically discovers the correct solution where naive top-two/Thompson-style heuristics may fail. Overall, we offers a minimal modification to the classical Frank–Wolfe algorithm, which is provably optimal for pure exploration.
Submission Number: 95
Loading