Abstract: Several low-bandwidth distributable black-box optimization algorithms in the family of finite differences such as Evolution Strategies have recently been shown to perform nearly as well as tailored Reinforcement Learning methods in some Reinforcement Learning domains. One shortcoming of these black-box methods is that they must collect information about the structure of the return function at every update, and can often employ only information drawn from a distribution centered around the current parameters. As a result, when these algorithms are distributed across many machines, a significant portion of total runtime may be spent with many machines idle, waiting for a final return and then for an update to be calculated. In this work we introduce a novel method to use older data in finite difference algorithms, which produces a scalable algorithm that avoids significant idle time or wasted computation.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - The portions of our paper relating to SGD and our modifications of it have been removed. The paper is now entirely about the inclusion of delayed information when computing updates to a policy with finite differences.
- We have conducted two new experiments: The first is an empirical study of the impact that delayed information has on the quality of the final policy when trained with our method for various delays and proportions of delayed information at each update, and the second is a study of PPO under the same conditions and in the same environments that we tested our method with.
- We have substantially improved our discussion of the bias imparted to the finite differences gradient estimator by the inclusion of delayed information and moved it from Appendix C to the main body of the text, where it is now Section 5.
- We have added a Related Work section that discusses several similar and prior methods.
- The abstract and introduction now more clearly outline the problem we have addressed and our motivations for doing so.
- The $U_\text{max}$ parameter has been removed. It was never encountered during our experiments, and its inclusion was entirely for potential settings where workers may be unintentionally disconnected from the learner for an extended period.
- We have stopped using the word "return" when referring to information that workers transmit to the learner, instead saying "data" or "information" where appropriate.
- We have noted the shortcomings of our method in the Experiments, Discussion, and Conclusion sections.
- The future work section was moved to the end of the discussion.
- The text has been substantially changed to fit the new focus of the paper.
- Table 2 has been changed to show the mean and standard deviation of the performance of the policies trained with the methods we studied over the final 1 million time-steps of training in each environment and random seed, rather than the performance of the highest performing policies those settings.
Assigned Action Editor: ~Michael_Bowling1
Submission Number: 572
Loading