Adaptive N-step Bootstrapping with Off-policy DataDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: reinforcement learning, policy evaluation
Abstract: The definition of the update target is a crucial design choice in reinforcement learning. Due to the low computation cost and empirical high performance, n-step returns with off-policy data is a widely used update target to bootstrap from scratch. A critical issue of applying n-step returns is to identify the optimal value of n. In practice, n is often set to a fixed value, which is either determined by an empirical guess or by some hyper-parameter search. In this work, we point out that the optimal value of n actually differs on each data point, while the fixed value n is a rough average of them. The estimation error can be decomposed into two sources, off-policy bias and approximation error, and the fixed value of n is a trade-off between them. Based on that observation, we introduce a new metric, policy age, to quantify the off-policyness of each data point. We propose the Adaptive N-step Bootstrapping, which calculates the value of n for each data point by its policy age instead of the empirical guess. We conduct experiments on both MuJoCo and Atari games. The results show that adaptive n-step bootstrap-ping achieves state-of-the-art performance in terms of both final reward and data efficiency.
One-sentence Summary: A new adaptive n-step bootstrapping training method for acceleration, which is based on a detailed analysis of the working mechanism of the n-step returns.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=IVANyrACKa
5 Replies

Loading