A Differentiable Sequence Model Perspective on Policy Gradients

Michel Ma; Pierluca D'Oro; Tianwei Ni; Clement Gehring; Pierre-Luc Bacon

A Differentiable Sequence Model Perspective on Policy Gradients

Michel Ma, Pierluca D'Oro, Tianwei Ni, Clement Gehring, Pierre-Luc Bacon

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: reinforcement learning, policy gradient, sequence models, gradients, transformers, credit assignment, world models, backpropagation, decision-aware

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose and analyze a model-based policy gradient framework where powerful sequence models demonstrably improve long-term credit assignment and sample efficiency in reinforcement learning.

Abstract: Progress in sequence modeling with deep learning has been driven by the advances in temporal credit assignment coming from better gradient propagation in neural network architectures. In this paper, we reveal that using deep dynamics models conditioned on sequences of actions allows to draw a direct connection between gradient propagation in neural networks and policy gradients, and to harness those advances for sequential decision-making. We leverage this connection to analyze, understand and improve policy gradient methods with tools that have been developed for deep sequence models, theoretically showing that modern architectures provably give better policy gradients. Furthermore, we empirically demonstrate that, in our algorithmic framework, better sequence models entail better policy optimization: when the environment dynamics is well-behaved, we find that better neural network architectures yield more accurate policy gradients; when it is chaotic or non-differentiable, we discover that neural networks are able to provide gradients better-suited for policy optimization compared to the real differentiable simulator. On an optimal control testbed, we show that, within our framework, agents enjoy increased long-term credit assignment capabilities and sample efficiency when compared to traditional model-based and model-free approaches.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2774

Loading