Multinomial Logit Contextual BanditsDownload PDF

05 May 2019 (modified: 05 May 2023)RL4RealLife 2019Readers: Everyone
Keywords: contextual bandit, upper-confidence bound, dynamic assortment selection
Abstract: We consider a dynamic assortment selection problem where the goal is to offer an assortment with cardinality constraint $K$ from a set of $N$ possible items. The sequence of assortments can be chosen as a function of the contextual information of items, and possibly users, and the goal is to maximize the expected cumulative rewards, or alternatively, minimize the expected regret. The distinguishing feature in our work is that feedback, i.e. the item chosen by the user, has a multinomial logistic distribution. We propose upper confidence interval based algorithms for this multinomial logit contextual bandit. The first algorithm is a simple and computationally more efficient method which achieves $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds with $d$ dimensional feature vectors. The second algorithm inspired by the work of \cite{li2017provably} achieves an $\tilde{O}(\sqrt{dT})$ with logarithmic dependence on $N$ and increased computational complexity because of pruning processes.
0 Replies

Loading