Multi-play Multi-armed Bandit Model with Scarce Sharable Arm Capacities

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-play multi-armed bandit, scarce sharable arm capacity, regret bounds
Abstract: This paper revisits multi-play multi-armed bandit with shareable arm capacities problem (MP-MAB-SAC), for the purpose of revealing fundamental insights on the statistical limits and data efficient learning. The MP-MAB-SAC is tailored for resource allocation problems arising from LLM inference serving, edge intelligence, etc. It consists of $K$ arms and each arm $k$ is associated with an unknown but deterministic capacity $m_k$ and per-unit capacity reward with mean $\mu_k$ and $\sigma$ sub-Gaussian noise. The aggregate reward mean of an arm scales linearly with the number of plays assigned to it until the number of plays hit the capacity limit $m_k$, and then the aggregate reward mean is fixed to $m_k \mu_k$. At each round only the aggregate reward is revealed to the learner. Our contributions are three folds. 1) \textit{Sample complexity:} we prove a minmax lower bound for the sample complexity of learning the arm capacity $\Omega(\frac{\sigma^2}{\mu^2_k} \log \delta^{-1})$, and propose an algorithm to exactly match this lower bound. This result closes the sample complexity gap of Wang et al. (2022a), whose lower and upper bounds are $\Omega(\log \delta^{-1})$ and $O (\frac{m^2_k \sigma^2}{\mu^2_k} \log \delta^{-1})$ respectively. 2) \textit{Regret lower bounds:} we prove an instance-independent regret lower bound $\Omega( \sigma \sqrt{TK} )$ and instance-dependent regret lower bound $\Omega(\sum_{k=1}^K\frac{c\sigma^2}{\mu_k^2} \log T)$. This result provides the first instance-independent regret lower bound and strengths the instance-dependent regret lower bound of Wang et al. (2022a) $\Omega(\sum_{k=1}^K \log T)$. 3) \textit{Data efficient exploration:}we propose an algorithm named \texttt{PC-CapUL}, in which we use prioritized coordination of arm capacities upper/lower confidence bound (UCB/LCB) to efficiently balance the exploration vs. exploitation trade-off. We prove both instance-dependent and instance-independent upper bounds for \texttt{PC-CapUL}, which match the lower bounds up to some acceptable model-dependent factors. This result provides the first instance-independent upper bound, and has the same dependency on $m_k$ and $\mu_k$ as Wang et al. (2022a) with respect to instance-dependent upper bound.But there is less information about arm capacity in our aggregate reward setting. Numerical experiments validate the data efficiency of \texttt{PC-CapUL}.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14140
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview