Abstract: In online retailing, the seller aims to offer assortment of items with maximized
revenue. We introduce a new online learning problem called dynamic assortment selection
with positioning (DAP) that additionally learns the optimal positioning within the assortment.
Specifically, the customers make purchases based on the item attractiveness as the
product of the position effect and unknown preference parameter through a multinomial
logit choice model. We first demonstrate that any assortment-only algorithm that neglects
position effects results in linear regrets. To address this gap, we propose the truncated linear
regression upper confidence bound (TLR-UCB) policy. TLR-UCB utilizes a novel geometric
linear bandit–type feedback structure for UCB construction under random and
adaptive position effects. In addition, TLR-UCB conducts well-designed truncations before
applying linear regression to handle conditional geometric responses. In theory, we establish
a regret upper bound of O˜ (T1=2) for TLR-UCB, matching our derived Ω(T1=2) lower
bound. Moreover, we develop an explore-in-TLR-UCB (EI-TLR) policy to tackle unknown
position effects. It first conducts a joint learning procedure to estimate unknown preferences
and position effects, and then implements a generalized TLR-UCB procedure driven
by estimated position effects. Extensive experiments demonstrate the superior performance
of TLR-UCB and EI-TLR over other benchmark policies.
Loading