How not to Lie with a Benchmark: Rearranging NLP LearderboardsDownload PDF

21 May 2021 (modified: 05 May 2023)NeurIPS 2021 SubmittedReaders: Everyone
Keywords: benchmarking, human evaluation, NLP benchmarks, GLUE, average, model evaluation
TL;DR: It is not correct to use arithmetic mean for NLP benchmarking. With preferable metrics humans are still the first.
Abstract: Proper model ranking and comparison with a human level is an essential requirement for every benchmark to be a reliable measurement of the model quality. Nevertheless, the methods for model comparison could have a fundamental flaw - the arithmetic mean of separate metrics is used for all tasks of different complexity, different size of test and training sets. In this paper, we examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results. We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME. The analysis shows that e.g. human level on SuperGLUE is still not reached, and there is still room for improvement for the current models.
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.
Supplementary Material: pdf
5 Replies

Loading