Abstract: We consider an online assortment optimization problem, where in every round, the retailer offers a K-cardinality subset (assortment) of N substitutable products to a consumer, and observes the response. We model consumer choice behavior using the widely used multinomial logit (MNL) model, and consider the retailer's problem of dynamically learning the model parameters, while optimizing cumulative revenues over the selling horizon T. Formulating this as a variant of a multi-armed bandit problem, we present an algorithm based on the principle of "optimism in the face of uncertainty." A naive MAB formulation would treat each of the N choose K possible assortments as a distinct "arm", leading to regret bounds that are exponential in K. We show that by exploiting the specific characteristics of the MNL model it is possible to design an algorithm with Õ(√NT) regret, under a mild assumption. We demonstrate that this performance is nearly optimal, by providing a (randomized) instance of this problem on which any online algorithm would incur at least ΩOmega(√NT/K) regret.
0 Replies
Loading