Keywords: pure exploration, (combinatorial) multi-armed bandit, Thompson Sampling
Abstract: Pure exploration plays an important role in online learning. Existing work mainly focuses on the UCB approach that uses confidence bounds of all the arms to decide which one is optimal. However, the UCB approach faces some challenges when looking for the best arm set under some specific combinatorial structures. It uses the sum of upper confidence bounds within arm set $S$ to judge whether $S$ is optimal. This sum can be much larger than the exact upper confidence bound of $S$, since the empirical means of different arms in $S$ are independent. Because of this, the UCB approach requires much higher complexity than necessary. To deal with this challenge, we explore the idea of Thompson Sampling (TS) that uses independent random samples instead of the upper confidence bounds to make decisions, and design the first TS-based algorithm framework TS-Verify for (combinatorial) pure exploration. In TS-Verify, the sum of independent random samples within arm set $S$ will not exceed the exact upper confidence bound of $S$ with high probability. Hence it solves the above challange, and behaves better than existing UCB-based algorithms under the general combinatorial pure exploration setting. As for pure exploration of classic multi-armed bandit, we show that TS-Verify achieves an asymptotically optimal complexity upper bound.
One-sentence Summary: This paper studies applying the Thompson Sampling approach to (combinatorial) pure exploration problems under the frequentist setting.
7 Replies
Loading