UEP: Unifying Estimation and Prediction for Non-stationary Multi-armed Bandits

Wang Tong; Jiahui Wang; Liu Yuxin; Ning Gui

UEP: Unifying Estimation and Prediction for Non-stationary Multi-armed Bandits

Wang Tong, Jiahui Wang, Liu Yuxin, Ning Gui

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Non-stationary Multi-armed Bandits, Predictive Modeling, Statistical Estimation, Regret Bounds

TL;DR: UEP unifies statistical estimation with predictive modeling through adaptive weighting to achieve superior performance in non-stationary multi-armed bandits where reward distributions change over time.

Abstract: Non-stationary multi-armed bandits present a fundamental challenge in sequential decision-making due to evolving reward distributions. Existing statistical estimation-based work often overlooked the learnable temporal patterns inherent in many real-world applications that encode valuable information for future trend prediction. To leverage such patterns, we propose a unified framework - UEP - to capture these dynamic patterns with a combination of both statistical estimation(estimation) and predictive model(predictor). According to estimation errors, UEP automatically determines optimal window sizes and the mixing weights in balancing predictor and estimator through an adaptively calculated weight, without requiring prior environmental knowledge. We prove regret bounds of $O(K ^{(3d+2/2d+1)} T^{(d+1)/(2d+1)}(\log(KT))^{1/2}$ . It improves upon existing $O(K^{1/3}T^{1-d/3})$ results when the environment changes fast at $d < 1$ under mild assumptions. With a series of experiments, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to the complex, rapidly evolving time series.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 5031

Loading