Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

ICLR 2024 Workshop DMLR Submission97 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarking, efficient model evaluation, dynamic benchmarks

TL;DR: This work introduces Lifelong Benchmarks and address the challenge of spiraling evaluation cost through an efficient evaluation framework called Sort & Search (S&S).

Abstract: Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. To mitigate this, we compile \textit{ever-expanding} large-scale benchmarks called \textit{Lifelong Benchmarks}---as exemplars, we create \textit{Lifelong-CIFAR10} and \textit{Lifelong-ImageNet}, containing (for now) 1.69M and 1.98M test samples, respectively. While reducing overfitting, lifelong benchmarks introduce a key challenge: high costs of evaluating a number of models across an ever-expanding sample set. To address this, we also introduce an efficient evaluation framework: \textit{Sort \& Search (S\&S)}, that leverages dynamic programming algorithms to selectively rank and sub-select test samples. Extensive empirical evaluations on 31,000 models demonstrate that \textit{S\&S} achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours ($\sim$1000x reduction) on a single A100 GPU.

Primary Subject Area: Data collection and benchmarking techniques

Paper Type: Research paper: up to 8 pages

Participation Mode: In-person

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 97

Loading