Sliding window delayed gradient for ensemble-based offline reinforcement learning
Abstract: Offline reinforcement learning aims to learn optimal policies from static datasets, presenting the key challenge of accurately estimating values for out-of-distribution actions. Ensemble-based methods address this issue by aggregating multiple networks to reduce uncertainty in Q-value estimates. However, previous ensemble-based methods suffer from performance degradation due to the inevitably high correlation among Q-functions, which results from identical architectures, shared inputs, and synchronized Bellman targets. In this paper, we propose Sliding Window Delayed Gradient (SWDG), a novel ensemble-based offline RL algorithm that leverages the temporal asynchrony introduced by the sliding window mechanism to dynamically maintain diversity among Q-functions. To further reduce correlation and extrapolation error, SWDG uses a part of delay updated networks to construct learning targets to update actor and critic, thereby temporally decoupling target construction and reducing error accumulation, while introducing an intra-window gradient-similarity regularization. We theoretically show that the sliding window mechanism tightens the pessimistic lower bound and enhances temporal decorrelation among Q-functions, while the use of delayed gradient targets further strengthens this guarantee. Our experiments on the D4RL benchmark demonstrate that SWDG achieves state-of-the-art performance.
Loading