- Keywords: Reinforcement Learning, Machine Learning, Neural Networks, Policy Gradients, Continuous Control
- Abstract: Reproducibility Summary Scope of Reproducibility We attempt to reproduce the claim that Softmax Deep Double Deterministic Policy Gradient (SD3) achieves superior performance over Twin Delayed Deep Double Deterministic Policy Gradient (TD3) on continuous control reinforcement learning tasks. We utilize both environments that were used by the paper and expand to include some not present. Methodology We compare the performance of TD3 and SD3 on a variety of continuous control tasks. We use the author's PyTorch code but also provide Tensorflow implementations of SD3 and TD3 (which we did not use for optimization reasons). For the control tasks we utilize OpenAI Gym environments with PyBullet, as opposed to MuJoCo, in an effort to bolster claims of generalization and to avoid exclusionary research practices. Experiments are conducted both on similar environments in the original paper and those that were not mentioned. Results Overall we reach similar, albeit much milder, conclusions as the paper, specifically, that SD3 outperforms TD3 on some of continuous control tasks. However, the advantage is not always as readily apparent as in the original work. Algorithmic performance was comparable on most environments, with SD3 providing limited evidence of definitive superiority. Further investigation and improvements are warranted. The results are not directly comparable to the original paper due to differences in physics simulators. Additionally, we did not perform hyperparameter optimization, which could potentially bolster returns on some environments. What was easy The author's made their code extremely easy to use, run, modify and rewrite in a different package. Because everything was available on their github and required only common reinforcement learning packages it was quick and painless to run. It was trivial to use the algorithms on different environments from different packages and collect their results for analysis. What was difficult One of the biggest difficulties was the time and resource consumption's of the experiments. Running each algorithm on each environment with a sufficient number of random seeds took the vast majority of the time. We had a total runtime of around 310 GPU hours (or 13 days). Time was our primary constraint and was the primary reason we did no investigate other environments. Simulator differences also proved to be somewhat challenging. Communication with original authors Our contact with the authors was limited to a discussion we had at their poster presentation at NeurIPS 2020.
- Paper Url: https://openreview.net/forum?id=9C6L8mcTqZy