Improved Online Confidence Bounds for Multinomial Logistic Bandits

Joongkyu Lee; Min-hwan Oh

Improved Online Confidence Bounds for Multinomial Logistic Bandits

Joongkyu Lee, Min-hwan Oh

Published: 01 May 2025, Last Modified: 24 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: We establish an improved online confidence bound for MNL models and, leveraging this result, propose a constant-time algorithm that achieves the tightest variance-dependent optimal regret in contextual MNL bandits.

Abstract: In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameter $B$ and the maximum size of possible outcomes $K$. To address this, we first derive an online confidence bound of $\mathcal{O}(\sqrt{d \log t} + B \sqrt{d} )$, which is a significant improvement over the previous bound of $\mathcal{O} (B \sqrt{d} \log t \log K )$ (Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and applying Ville’s inequality to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm, **OFU-MNL++**, which achieves a variance-dependent regret bound of $\mathcal{O} \Big( d \log T \sqrt{ \sum_{t=1}^T \sigma_t^2 } \Big) $ for sufficiently large $T$, where $\sigma_t^2$ denotes the variance of the rewards at round $t$, $d$ is the dimension of the contexts, and $T$ is the total number of rounds. Furthermore, we introduce a Maximum Likelihood Estimation (MLE)-based algorithm, **OFU-M$^2$NL**, which achieves an anytime $\operatorname{poly}(B)$-free regret of $\mathcal{O} \Big( d \log (BT) \sqrt{ \sum_{t=1}^T \sigma_t^2 } \Big) $.

Lay Summary: Many AI systems repeatedly face the challenge of choosing from multiple options — such as which product to recommend or which ad to show — while learning from user feedback to improve over time. This paper tackles that challenge and proposes a new algorithm that achieves state-of-the-art performance in both learning efficiency and computational speed. Our method enables AI systems to make smarter, more confident decisions even under uncertainty. It introduces a more effective way to gauge how reliable the system’s knowledge is, while avoiding the heavy computations that previous methods require. As a result, our algorithm is faster, more scalable, and easier to deploy in real-world applications. It helps AI systems learn quickly, use fewer resources, and make better choices — enabling more responsive and intelligent tools for recommendation, search, and personalized services.

Primary Area: Theory->Online Learning and Bandits

Keywords: Bandit, Multinomial Logistic Bandit, Confidence Bound, Regret, Online Mirror Descent

Submission Number: 10748

Loading