Greedy Multi-Step Off-Policy Reinforcement Learning

Yuhui Wang; Xiaoyang Tan

Greedy Multi-Step Off-Policy Reinforcement Learning

Yuhui Wang, Xiaoyang Tan

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Keywords: off-policy learning, multi-step reinforcement learning, Q learning

Abstract: This paper presents a novel multi-step reinforcement learning algorithms, named Greedy Multi-Step Value Iteration (GM-VI), under off-policy setting. GM-VI iteratively approximates the optimal value function of a given environment using a newly proposed multi-step bootstrapping technique, in which the step size is adaptively adjusted along each trajectory according to a greedy principle. With the improved multi-step information propagation mechanism, we show that the resulted VI process is capable of safely learning from arbitrary behavior policy without additional off-policy correction. We further analyze the theoretical properties of the corresponding operator, showing that it is able to converge to globally optimal value function, with a rate faster than traditional Bellman Optimality Operator. Experiments reveal that the proposed methods is reliable, easy to implement and achieves state-of-the-art performance on a series of standard benchmark datasets.

One-sentence Summary: A novel RL algorithm which allows efficient multi-step bootstrap while being capable of safely learning from arbitrary behavior policy without additional off-policy correction

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=5Anqt8m42I

6 Replies

Loading