**Abstract:**Modern platforms leverage randomized experiments to make informed decisions from a given set of alternatives. As a particularly challenging scenario, these alternatives can potentially have (i) high volume, with thousands of new items being released each hour, and (ii) short lifetime, either due to the contents' transient nature, or some underlying non-stationarity that impels the learner to treat the same item as non-identical copies across time. We consider a multiplay bandits model. In each round a set of $k=n^\rho$ actions that will be available for $w$ rounds arrives, each of whose mean reward is drawn from a fixed known distribution. The learner selects a multiset of $n$ actions at a time. We propose an $\ell$-Layered Sieve Policy that recursively refines the action space for $\ell\leq w$ times. We show that for any given $\rho>0$, with suitable $\ell$, the policy achieves $\tilde O (n^{-\min \{\rho, \frac 12 (1+\frac 1w)^{-1}\}})$ regret. We also complement this result with an $\Omega (n^{-\min \{\rho, \frac 12\}})$ lower bound. We further validate the effectiveness of our Sieve Policy via numerical simulations and a field experiment in a large content card serving platform.

**Submission Number:**6522

Loading