Improving Reproducibility of Benchmarks

Xavier Bouthillier

13 Jun 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: Benchmarks are at the core of research in deep learning. It is the most common method to measure the importance of new contributions. Competitions in the form of benchmarks with leaderboards have been systematized by organisations such as Kaggle and academic results are registered to facilitate comparisons in papers2. There has been growing voices around the concern of reproducibility lately [3, 5, 6, 7, 8, 9, 10]. This may seems unrelated to benchmarking, but we argue that the latter is at the very core of the reproducibility issue in the field of deep learning. The common answer to reproducility issues is code and data sharing, in other words better description of the specifications [2, 4]. Recent works have however raised issues that cannot be addressed with improved communication of specifications [3, 5, 7, 8, 9]. As argued in [1] it appears that the core of the issue would revolve around the experiment design, which in most cases do not satisfy inferential reproducibility, that is, it cannot provide corroboration of conclusions. As introduced above, most works in the deep learning litterature rely on benchmarking to measure the importance of the work’s contribution, making it the foundation of the experiment design used. However, benchmarks can be very sensitive to sources of variations and therefore lead to unreproducible leaderboards. We used the data provided in [1] and conducted simulations to illustrate how standard benchmarks of deep learning models tend to have a poor rate of reproduction. In response, we suggest to include tools from statistical tests to the design of benchmarks to improve their reproducibility.

0 Replies