TL;DR: We establish an improved online confidence bound for MNL models and, leveraging this result, propose a constant-time algorithm that achieves the tightest variance-dependent optimal regret in contextual MNL bandits.
Abstract: In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameter $B$ and the maximum size of possible outcomes $K$. To address this, we first derive an online confidence bound of $\mathcal{O} (\sqrt{d \log t} + B )$, which is a significant improvement over the previous bound of $\mathcal{O} (B \sqrt{d} \log t \log K )$ (Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and introducing a novel intermediary term to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm, **OFU-MNL++**, which achieves a variance-dependent regret bound of $\mathcal{O} \Big( d \log T \sqrt{ \sum_{t=1}^T \sigma_t^2 } \Big) $ for sufficiently large $T$, where $\sigma_t^2$ denotes the variance of the rewards at round $t$, $d$ is the dimension of the contexts, and $T$ is the total number of rounds. Furthermore, we introduce an Maximum Likelihood Estimation (MLE)-based algorithm that achieves an anytime, **OFU-M$^2$NL**, $\operatorname{poly}(B)$-free regret of $\mathcal{O} \Big( d \log (BT) \sqrt{ \sum_{t=1}^T \sigma_t^2 } \Big) $.
Lay Summary: Many AI systems repeatedly face the challenge of choosing from multiple options — such as which product to recommend or which ad to show — while learning from user feedback to improve over time. This paper tackles that challenge and proposes a new algorithm that achieves state-of-the-art performance in both learning efficiency and computational speed.
Our method enables AI systems to make smarter, more confident decisions even under uncertainty. It introduces a more effective way to gauge how reliable the system’s knowledge is, while avoiding the heavy computations that previous methods require.
As a result, our algorithm is faster, more scalable, and easier to deploy in real-world applications. It helps AI systems learn quickly, use fewer resources, and make better choices — enabling more responsive and intelligent tools for recommendation, search, and personalized services.
Primary Area: Theory->Online Learning and Bandits
Keywords: Bandit, Multinomial Logistic Bandit, Confidence Bound, Regret
Submission Number: 10748
Loading