Offline Contextual Bandits with Covariate Shift

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0
Keywords: bandit learning; covarite shift; distributionally robust learning;
Abstract: Offline policy learning aims to optimize decision-making policies using historical data and plays a central role in many real-world applications, such as personalized advertising, medical treatment recommendation, and pricing decisions. A major challenge in this setting is the potential mismatch between the training environment---where the data were collected---and the test environment---where the learned policy is evaluated. This challenge has motivated extensive research on distributionally robust methods, which aim to maintain performance under worst-case distribution shifts. However, such approaches can be overly conservative when the environment changes in more structured ways. In this paper, we focus on the offline learning setting where the only difference between the training and test environments lies in the distribution of the context variables. Adopting the concept of transfer exponents from the transfer learning literature to model such covariate shift, we establish minimax-optimal sample complexity bounds for offline decision-making with general nonparametric reward functions. We further show that a pessimism-based algorithm attains these optimal rates.
Submission Number: 186
Loading