Data-efficient Hindsight Off-policy Option Learning

Markus Wulfmeier; Dushyant Rao; Roland Hafner; Thomas Lampe; Abbas Abdolmaleki; Tim Hertweck; Michael Neunert; Dhruva Tirumala; Noah Yamamoto Siegel; Nicolas Heess; Martin Riedmiller

Data-efficient Hindsight Off-policy Option Learning

Markus Wulfmeier, Dushyant Rao, Roland Hafner, Thomas Lampe, Abbas Abdolmaleki, Tim Hertweck, Michael Neunert, Dhruva Tirumala, Noah Yamamoto Siegel, Nicolas Heess, Martin Riedmiller

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Hierarchical Reinforcement Learning, Off-Policy, Abstractions, Data-Efficiency

Abstract: Hierarchical approaches for reinforcement learning aim to improve data efficiency and accelerate learning by incorporating different abstractions. We introduce Hindsight Off-policy Options (HO2), an efficient off-policy option learning algorithm, and isolate the impact of action and temporal abstraction in the option framework by comparing flat policies, mixture policies without temporal abstraction, and finally option policies; all with comparable policy optimization. When aiming for data efficiency, we demonstrate the importance of off-policy optimization, as even flat policies trained off-policy can outperform on-policy option methods. In addition, off-policy training and backpropagation through a dynamic programming inference procedure -- through time and through the policy components for every time-step -- enable us to train all components' parameters independently of the data-generating behavior policy. We continue to illustrate challenges in off-policy option learning and the related importance of trust-region constraints. Experimentally, we demonstrate that HO2 outperforms existing option learning methods and that both action and temporal abstraction provide strong benefits in particular in more demanding simulated robot manipulation tasks from raw pixel inputs. Finally, we develop an intuitive extension to encourage temporal abstraction and investigate differences in its impact between learning from scratch and using pre-trained options.

One-sentence Summary: We develop an efficient off-policy option learning method, isolate the impact of action and temporal abstraction, demonstrate the importance and challenges of off-policy learning and solve challenging tasks from raw pixels.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=K0TquUikR_

21 Replies

Loading