Abstract: Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we question the value of performance evaluation as a primary experimentation tool and argue for using a qualitatively different experimentation paradigm that can provide more insight from less computation. Furthermore, we strongly recommend that the community switch to the new experimentation paradigm and encourage reviewers to adopt stricter standards for experiments.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Fixed a typo in Appendix A saying NeurIPS 2023 when it should say 2022.
Assigned Action Editor: ~Aleksandra_Faust1
Submission Number: 812
Loading