Can we hop in general? A discussion of benchmark selection and design using the Hopper environment

Published: 07 Jun 2024, Last Modified: 09 Aug 2024RLC 2024 ICBINB PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning
Abstract: While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common practice, this choice is rarely discussed. In this paper, we present a case study on different variants of the Hopper environment to showcase that the selection of standard benchmarking suites is important for judging the performance of algo- rithms. To the best of our knowledge, no previous work has inspected whether these different variants are interchangeable for the purposes of evaluating algorithms. We show that this is not the case, by comparing four representative algorithms on both. Our experimental results suggests a larger issue in the deep RL literature: bench- mark choices are neither commonly justified, nor does there exist a language that could be used to justify the selection of certain environments. This paper con- cludes with a discussion of the requirements for proper discussion and evaluations of benchmarks and recommends steps to start a dialogue towards this goal.
Submission Number: 12
Loading