Keywords: reinforcement learning, offline RL, offline reinforcement learning, residual algorithms, residual gradient
Abstract: The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semi-gradient algorithm (SG) suffers from well-known instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG’s robust convergence and SG’s speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC gives state-of-the-art scores for about half of the D4RL gym tasks. We further show that using the minimum of ten critics lets our algorithm approximately match SAC-$N$'s state-of-the-art returns using 50$\times$ less compute. In contrast, TD3+BC with the same minimum-of-ten-critics trick does not match SAC-$N$'s returns on many environments. The only hyperparameter we tune is our residual weight — we leave all other hyperparameters unchanged from SAC-$N$.
TL;DR: We roughly match SAC-$N$'s SOTA scores on D4RL gym tasks using 50$\times$ less compute, simply by developing and tuning a particular residual weight but leaving all other hyperparameters untouched