Multinomial Logit Contextual Bandits

May 05, 2019 Submission readers: everyone
  • Keywords: contextual bandit, upper-confidence bound, dynamic assortment selection
  • Abstract: We consider a dynamic assortment selection problem where the goal is to offer an assortment with cardinality constraint $K$ from a set of $N$ possible items. The sequence of assortments can be chosen as a function of the contextual information of items, and possibly users, and the goal is to maximize the expected cumulative rewards, or alternatively, minimize the expected regret. The distinguishing feature in our work is that feedback, i.e. the item chosen by the user, has a multinomial logistic distribution. We propose upper confidence interval based algorithms for this multinomial logit contextual bandit. The first algorithm is a simple and computationally more efficient method which achieves $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds with $d$ dimensional feature vectors. The second algorithm inspired by the work of \cite{li2017provably} achieves an $\tilde{O}(\sqrt{dT})$ with logarithmic dependence on $N$ and increased computational complexity because of pruning processes.
0 Replies