Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning

Thomas Liao; Rohan Taori; Inioluwa Deborah Raji; Ludwig Schmidt

Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, Ludwig Schmidt

Published: 11 Oct 2021, Last Modified: 23 May 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone

Keywords: evaluation, progress, benchmarks, meta-survey, meta-review, validity, transfer

TL;DR: We present a meta-review of evaluation failures across subfields of machine learning, finding surprisingly consistent failure modes.

Abstract: Many subfields of machine learning share a common stumbling block: evaluation. Advances in machine learning often evaporate under closer scrutiny or turn out to be less widely applicable than originally hoped. We conduct a meta-review of 107 survey papers from natural language processing, recommender systems, computer vision, reinforcement learning, computational biology, graph learning, and more, organizing the wide range of surprisingly consistent critique into a concrete taxonomy of observed failure modes. Inspired by measurement and evaluation theory, we divide failure modes into two categories: internal and external validity. Internal validity issues pertain to evaluation on a learning problem in isolation, such as improper comparisons to baselines or overfitting from test set re-use. External validity relies on relationships between different learning problems, for instance, whether progress on a learning problem translates to progress on seemingly related tasks.

Supplementary Material: pdf

URL: https://github.com/tholiao/are_we_learning_yet

Contribution Process Agreement: Yes

Author Statement: Yes

12 Replies

Loading