Differentiable Trajectory Optimization as a Policy Class for Reinforcement and Imitation Learning

Weikang Wan; Yufei Wang; Zackory Erickson; David Held

Differentiable Trajectory Optimization as a Policy Class for Reinforcement and Imitation Learning

Weikang Wan, Yufei Wang, Zackory Erickson, David Held

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: differentiable trajectory optimization, model-based reinforcement learning, imitation learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: This paper introduces DiffTOP, a new policy class for reinforcement learning and imitation learning that utilizes differentiable trajectory optimization to generate the policy actions.

Abstract: This paper introduces DiffTOP, a new policy class for reinforcement learning and imitation learning that utilizes differentiable trajectory optimization to generate the policy actions. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end, e.g., using the policy gradient loss in reinforcement learning, or using the imitation loss in imitation learning. When applied to model-based reinforcement learning, DiffTOP addresses the “objective mismatch” issue of prior algorithms, as the dynamics model in DiffTOP is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. When applied to imitation learning, DiffTOP performs test-time trajectory optimization to compute the actions with a learned cost function, outperforming prior methods that only perform forward passes of the policy network to generate actions. We benchmark DiffTOP on 15 model-based RL tasks, and 13 imitation learning tasks with high-dimensional image and point cloud inputs, and show that it outperforms prior state-of-the-art methods in both domains.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1693

Loading