Abstract: In this paper, we study a dynamic assortment optimization problem under bandit feedback, where a seller with a fixed initial inventory of N substitutable products faces a sequence of i.i.d. customer arrivals (with an unknown distribution) over a time horizon of T periods, and needs to decide in each period on an assortment of products to offer to the customer to maximize the total expected revenue. Such a problem arises in many applications including online retail and recommendations. The seller has initially no (or only limited) information about the customer's preferences and needs to learn them through repeated interaction with the i.i.d. customers. Specifically, in each period, the seller offers an assortment to the customer; the customer makes a choice from the assortment according to the unknown preferences or choice model, and the seller only observes the eventual choice from the given assortment and needs to update the estimate and future actions under this bandit feedback. Therefore, this problem exemplifies the classical trade-off between exploitation and exploration: the seller needs to simultaneously gain information about the customer's preferences and offer revenue-maximizing assortments, while respecting the resource constraints.
0 Replies
Loading