Perishable Online Inventory Control with Context-Aware Demand Distributions

Perishable Online Inventory Control with Context-Aware Demand Distributions

ICLR 2026 Conference Submission9897 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: online learning, inventory control, kernel regression, contextual bandits, neural bandits

TL;DR: We study online contextual inventory control of perishable goods where the demand distribution can depend on contexts (and in a nonparametric way); then we give the regret lower bound and a near-optimal algorithm.

Abstract: We study the online contextual inventory control problem with perishable goods. In this work, we propose and consider a more realistic---and more challenging---setting where both the expected demand and the (residual) noise distribution depend on the observable features. Surprisingly, little is known when the noise is context-dependent, which captures the heteroskedastic uncertainty in demand that is important in inventory control. The optimal inventory quantity in this general setting is no longer a linear function of features (unlike the case when the expected demand is linear and the noise is i.i.d.), making online gradient descent---the gold standard therein---inapplicable. We first propose an algorithm that achieves the near-optimal regret $\widetilde{O}(\sqrt{d T}+T^{\frac{p+1}{p+2}})$ under linear expected demand and context-aware noise. Here $d$ is the feature dimension, and $p \leq d$ is an underlying dimension that captures the intrinsic complexity of the noise distribution. When the expected demand is nonlinear, we propose to use neural networks to capture the nonlinearity, and prove a regret bound $\widetilde{O}(\sqrt{\alpha T}+T^{\frac{p+1}{p+2}})$ under over-parameterized networks, where $\alpha$ depends on the nonlinear demand complexity and the network architecture. Additionally, under mild regularity conditions on the noise, the exponential factor $T^{\frac{p+1}{p+2}}$ in these regret bounds is improved to $p\sqrt{T}$. Finally, we provide a matching minimax lower bound $\Omega(\sqrt{d T}+T^{\frac{p+1}{p+2}})$ under linear expected demand. To our best knowledge, our results provide the first minimax optimal characterization for online inventory control with context-dependent noise and the first theoretical guarantees when the expected demand is nonlinear in features.

Primary Area: learning theory

Submission Number: 9897

Loading