Learning in Continuous State-Space MDPs for Network Inventory Management
TL;DR: This paper presents an online learning algorithm, with a provably near-optimal regret rate, for managing multi-location inventory networks under unknown demand and censored sales data.
Abstract: We consider online learning in infinite-horizon, average-cost Markov Decision Processes (MDPs) with multi-dimensional, continuous state spaces and censored feedback. Our model setting, motivated by network inventory management applications such as vehicle sharing, is characterized by complex, correlated state transitions and the absence of value function convexity, rendering standard analytical techniques for both MDPs and inventory control inapplicable. Our primary contribution is a integrated framework establishing and leveraging the Lipschitz property of the long-run average cost function. This insight allows us to analyze the problem through the lens of Lipschitz bandits, for which we design a provably efficient online learning algorithm that learns a near-optimal policy from censored demand data. We derive a high-probability regret bound of $O(T^{\frac{n}{n+1}} (\log T)^{\frac{1}{n+1}})$, where $n$ is the network size through new concentration inequalities for cumulative costs in MDPs with state-dependent transitions. Furthermore, we devise a matching lower bound for this learning problem, which captures the inherent dimensionality challenge.
Submission Number: 297
Loading