EEVEE and GATE: Finding the right benchmarks and how to run them seamlessly

Antreas Antoniou; Eleni Triantafillou; Hugo Larochelle; Sebastien Montella; Fady Rezk; Kiyoon Kim; Linus Ericsson; Pavlos Vougiouklis; Justin Engelmann; Elliot J. Crowley; Srihari Humbarwadi; Yi Liu; Guang Yang; Jeff Z. Pan; Amos Storkey

EEVEE and GATE: Finding the right benchmarks and how to run them seamlessly

Antreas Antoniou, Eleni Triantafillou, Hugo Larochelle, Sebastien Montella, Fady Rezk, Kiyoon Kim, Linus Ericsson, Pavlos Vougiouklis, Justin Engelmann, Elliot J. Crowley, Srihari Humbarwadi, Yi Liu, Guang Yang, Jeff Z. Pan, Amos Storkey

20 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-modal, benchmarks, machine learning, model evaluation, benchmark frameworks

TL;DR: EEVEE optimizes model evaluation by identifying high-predictive subsets from diverse benchmarks. In our exp it finds Pareto-optimal sets like iWildCam, CLEVR-Math, ACDC, and WinoGround. These are packaged into GATE, a model encoder evaluation engine.

Abstract: Model evaluation is a cornerstone of machine learning, guiding model design and progress measurement. Designing generalizable evaluation processes remains a challenge, however, partly due to the vast number of possible domain, task and modality combinations and lack of knowledge of how informative they are. In this paper, we propose EEVEE (Efficient Evaluation process Evolution Engine) - pronounced as \textipa{/'i:vi:/} EE-vee - a method that frames evaluation process design as a learning problem. By analyzing a large number of evaluation metrics from diverse benchmarks and models, EEVEE identifies a smaller subset of tasks with high predictive power over the full set of evaluation metrics, reducing evaluation time. To find the optimal subset maximizing signal while minimizing GPU hours, EEVEE evaluates pre-trained models of various architectures, pretraining schemes, and modalities on diverse downstream tasks and datasets including image classification, segmentation, relational reasoning, zero-shot image-to-text tasks, medical classification and segmentation, video classification, and regression. Our results identify three subsets of benchmarks, with 8, 15 and 21 tasks, providing high quality signal for model generalization. Key benchmarks selected include iWildCam, CLEVR-Math, ACDC, WinoGround, CIFAR100, Fungi, and ADE20K. We structure the subsets into three tiers for 12, 24, and 36 GPU-hour budgets and package them into a unified, efficient, and user-friendly Python framework that we built with the researcher in mind -- which we refer to as the GATE engine. Our experiments reveal ConvNextV2, SigLIP and CLIP as top-performing model encoders, with EfficientNetV2 and ResNext50 excelling in medical tasks and challenging image classification, in particular in Happy Whale Individual classification, ConvNet based models seem to outperform transformer models by a factor of 2.5x, which is surprising. The top performing encoder being ConvNextV2 followed by CLIP seems to agree with other recent large scale evaluations. We also demonstrate the framework's versatility in fine-tuning models from text and audio modalities, paving the way for future cross-modal evaluations.

Submission Number: 543

Loading