Keywords: reinforcement learning
Abstract: While using off-the-shelf benchmarks in reinforcement learning (RL) research is a
common practice, this choice is rarely discussed. In this paper, we present a case
study on different variants of the Hopper environment to showcase that the selection
of standard benchmarking suites is important for judging the performance of algo-
rithms. To the best of our knowledge, no previous work has inspected whether these
different variants are interchangeable for the purposes of evaluating algorithms. We
show that this is not the case, by comparing four representative algorithms on both.
Our experimental results suggests a larger issue in the deep RL literature: bench-
mark choices are neither commonly justified, nor does there exist a language that
could be used to justify the selection of certain environments.
This paper con-
cludes with a discussion of the requirements for proper discussion and evaluations
of benchmarks and recommends steps to start a dialogue towards this goal.
Submission Number: 12
Loading