Off-policy Multi-step Q-learning

Gabriel Kalweit; Maria Huegle; Joschka Boedecker

Off-policy Multi-step Q-learning

Gabriel Kalweit, Maria Huegle, Joschka Boedecker

25 Sept 2019 (modified: 12 Oct 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Multi-step Learning, Off-policy Learning, Q-learning

TL;DR: The paper is about estimating the full return in off-policy Reinforcement Learning via a combination of short- and long-term predictions.

Abstract: In the past few years, off-policy reinforcement learning methods have shown promising results in their application for robot control. Deep Q-learning, however, still suffers from poor data-efficiency which is limiting with regard to real-world applications. We follow the idea of multi-step TD-learning to enhance data-efficiency while remaining off-policy by proposing two novel Temporal-Difference formulations: (1) Truncated Q-functions which represent the return for the first n steps of a policy rollout and (2) Shifted Q-functions, acting as the farsighted return after this truncated rollout. We prove that the combination of these short- and long-term predictions is a representation of the full return, leading to the Composite Q-learning algorithm. We show the efficacy of Composite Q-learning in the tabular case and compare our approach in the function-approximation setting with TD3, Model-based Value Expansion and TD3(Delta), which we introduce as an off-policy variant of TD(Delta). We show on three simulated robot tasks that Composite TD3 outperforms TD3 as well as state-of-the-art off-policy multi-step approaches in terms of data-efficiency.

Code: https://gofile.io/?c=lmcyx5

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/off-policy-multi-step-q-learning/code)

Original Pdf: pdf

18 Replies

Loading