Keywords: Non-stationary Multi-armed Bandits, Predictive Modeling, Statistical Estimation, Regret Bounds
TL;DR: UEP unifies statistical estimation with predictive modeling through adaptive weighting to achieve superior performance in non-stationary multi-armed bandits where reward distributions change over time.
Abstract: Non-stationary multi-armed bandits present a fundamental challenge in sequential decision-making due to evolving reward distributions. Existing statistical estimation-based work often overlooked the learnable temporal patterns inherent in many real-world applications that encode valuable information for future trend prediction. To leverage such patterns, we propose a unified framework - UEP - to capture these dynamic patterns with a combination of both statistical estimation(estimation) and predictive model(predictor). According to estimation errors, UEP automatically determines optimal window sizes and the mixing weights in balancing predictor and estimator through an adaptively calculated weight, without requiring prior environmental knowledge. We prove regret bounds of $O(K ^{(3d+2/2d+1)} T^{(d+1)/(2d+1)}(\log(KT))^{1/2}$ . It improves upon existing $O(K^{1/3}T^{1-d/3})$ results when the environment changes fast at $d < 1$ under mild assumptions. With a series of experiments, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to the complex, rapidly evolving time series.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 5031
Loading