Optimal Trade-offs between Regret and Estimation in Capacitated Multinomial Logit Bandits

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multinomial Logit Bandits, Pareto Optimality, Assortment Optimization
Abstract: Online decision-making involves a fundamental trade-off between two objectives. The first is *regret minimization*, which aims to maximize cumulative reward; the second is *parameter estimation*, which aims to learn the underlying model for downstream tasks. While this trade-off is well-studied in multi-armed bandits (MAB), it remains far less understood in multinomial logit (MNL) bandits, where the decision space is combinatorially large. The only prior work Zuo & Qin (2025) is limited to the uncapacitated case and lacks a tight characterization of the dependence on the number of items $N$. In this work, we establish tight trade-off bounds between regret and customer attraction estimation error for capacitated MNL bandits, with a sharp dependence on $N$. To match these bounds, we introduce an algorithm that achieves the optimal trade-off, providing the first complete characterization of *Pareto optimality* in this setting. The lower-bound technique underlying our results is broadly applicable and also strengthens existing results for MAB. Beyond attraction estimation, our analysis further extends to customer preference estimation error, where the same guarantees continue to hold. As a further application, our framework addresses the joint assortment and pricing problem, yielding new insights into the regret-estimation trade-off in broader contexts.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 9261
Loading