Can we hop in general? A discussion of benchmark selection and design using the Hopper environment

Published: 04 Jun 2024, Last Modified: 19 Jul 2024Finding the Frame: RLC 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark evaluation, philosophy of rl
TL;DR: We show, using the Hopper environment, that common test benchmarks for RL are not representative of intuitive problem classes.
Abstract: While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common practice, this choice is rarely discussed. In this paper, we present a case study on different variants of the Hopper environment to showcase that the selection of standard benchmarking suites is important for judging the performance of algorithms. To the best of our knowledge, no previous work has inspected whether these different variants are interchangeable for the purposes of evaluating algorithms. We show that this is not the case, by comparing four representative algorithms on both. Our experimental results suggests a larger issue in the deep RL literature: benchmark choices are neither commonly justified, nor does there exist a language that could be used to justify the selection of certain environments. This paper concludes with a discussion of the requirements for proper discussion and evaluations of benchmarks and recommends steps to start a dialogue towards this goal.
Submission Number: 22
Loading