Keywords: Offline RL, Q-ensemble, Sliding Window, Uncertainty Estimation
TL;DR: The proposed SWDG is an offline-RL ensemble method that uses a sliding window with delayed-gradient targets to decorrelate Q-networks, tighten pessimistic bounds, curb OOD error, and reach state-of-the-art results on D4RL.
Abstract: Offline reinforcement learning aims to learn optimal policies from static datasets, which brings the key challenge of accurately estimating values for out-of-distribution actions. Ensemble-based methods address this issue by aggregating multiple Q-networks to reduce the uncertainty in Q-value estimates. However, previous related studies suffer from the inevitably high correlation among Q-functions, driven by identical architectures, shared inputs, and synchronized Bellman targets. Such correlation reduces the robustness of Q-ensembles, ultimately leading to degraded policy performance. In this paper, we propose sliding window delayed gradient (SWDG), a novel ensemble-based offline RL algorithm that leverages the temporal asynchrony induced by the sliding-window mechanism to dynamically maintain diversity among Q-functions. Meanwhile, to further reduce extrapolation error and correlation, SWDG uses Q-networks tied to the sliding window as delayed-gradient target to compute the temporal-difference (TD) error. We theoretically show that the sliding window mechanism tightens the pessimistic lower bound and enhances temporal decorrelation among Q-functions, while the use of delayed-gradient target further strengthens this guarantee. Our experiments on the D4RL benchmark show that SWDG achieves state-of-the-art performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 4847
Loading