Abstract: Thorough evaluation of the performance of reinforcement learning agents is critical to establish significant progress in the field, with benchmarks being the key component of this process. In the tabular setting, a rich theory of environment hardness has been recently leveraged to design benchmarks with precise characterizations of hardness. In contrast, the non-tabular setting currently lacks such a theory and instead relies on expert judgments and community popularity to establish benchmarks. This reliance on subjective assessments can limit the rigour and reliability of the evaluation process. The goal of this paper is to take the first step towards the design of principled non-tabular benchmarks by four main contributions. First, we review the theory of hardness in the tabular and non-tabular settings to highlight promising directions. Second, we identify the essential features that a principled benchmarking library for non-tabular reinforcement learning should possess while explaining the limitations of existing libraries in meeting those needs. Third, we propose a new library (pharos) specifically designed to support the development of principled benchmarking. Finally, we present an in-depth case study that, in addition to illustrating examples of the kind of analysis that pharos facilitates, demonstrates that, while tabular measures can represent a component in quantifying non-tabular hardness, it is necessary to develop measures tailored for the non-tabular setting.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sebastian_Tschiatschek1
Submission Number: 3484
Loading