Abstract: In a contextual pricing problem, a seller aims at maximizing the revenue over a sequence of sales sessions (described by feature vectors) using binary-censored feedback of "sold" or "not sold". Existing methods often overlook two practical challenges (1) the best pricing strategy could change over time; (2) the prices and pricing policies must conform to hard constraints due to safety, ethical or legal restrictions. We address both challenges by solving a more general problem of "universal dynamic regret" minimization in proper online learning with exp-concave losses --- an open problem posed by Baby & Wang (2021) that we partially resolve in this paper, with attention restricted to loss functions coming from a generalized linear model. Here "dynamic regret" measures the performance relative to a non-stationary sequence of policies, and "proper" means that the learner must choose feasible strategies within a pre-defined convex set, which we use to model the safety constraints. In this work, we consider a linear noisy valuation model for the customers. In the case of a known strictly log-concave market noise, our algorithm achieves $\tilde{O}(d^3T^{1/3}C_T^{2/3} \vee d^3)$ dynamic regret in comparison with the optimal policy series, where $T$, $d$ and $C_T$ stand for the time horizon, the feature dimension and the total variation (characterizing non-stationarity) respectively. This regret is near-optimal with respect to $T$ (within $O(\log T)$ gaps) and $C_T$, and our algorithm is adaptable to unknown $C_T$ and remains feasible throughout. However, the dependence on $d$ is suboptimal and the minimax rate is still open.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Andras_Gyorgy1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 216
Loading