Keywords: Diffusion models, offline reinforcement learning, deep learning
TL;DR: We develop a TD update for the state distribution and use this to
Abstract: The state occupancy measure and successor state measure are important theoretical tools in reinforcement learning that represent the distribution of future states. However, while these tools see extensive use in theory and theoretically-motivated algorithms, they have not seen significant use in practical settings because existing algorithms for learning SOM and SSM are high-variance or unstable in practice. To address this, we explore using diffusion models as a representation for the state successor measure. We find that enforcing the Bellman flow constraints on a diffusion model leads to a temporal difference update on the predicted noise, similar to the standard TD-learning update on the predicted reward. As a result, our method has the expressive power of a diffusion model, and a low variance that is comparable to that of TD-learning. To demonstrate this method's practicality, we propose a simple reinforcement learning algorithm based on regularizing the learned SSM. We test the proposed method on an array of offline RL problems, and find it has the highest average performance of all methods in the literature, as well as achieving state-of-the-art performance on several environments.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Liam_Schramm2
Track: Regular Track: unpublished work
Submission Number: 142
Loading