Trustworthy model evaluation on a budgetDownload PDF

Published: 16 Apr 2023, Last Modified: 04 May 2023RTML Workshop 2023Readers: Everyone
Keywords: Explainability, Trustworthy, Model Evaluation, Large-Scale
TL;DR: Current Machine Learning practices for performing ablation experiments can lead to unreliable conclusions where the selection of hyperparameter and the computational budget have strong interaction effects.
Abstract: Standard practice in Machine Learning (ML) research uses ablation studies to evaluate a novel method. We find that errors in the ablation setup can lead to incorrect explanations for which method components contribute to the performance. Previous work has shown that the majority of experiments published in top conferences are performed with few experimental trials (less than 50) and manual sampling of hyperparameters. Using the insights from our meta-analysis, we demonstrate how current practices can lead to unreliable conclusions. We simulate an ablation study experiment on an existing Neural Architecture Search (NAS) benchmark and perform an ablation study with 120 trials using ResNet50. We quantify the selection bias of Hyperparameter Optimization (HPO) strategies to show that only random sampling can produce reliable results when determining the top and mean performance of a method under a limited computational budget.
0 Replies

Loading