Abstract: The evaluation of machine learning (ML) models is a core tenet of trustworthy use. Evaluation is typically done via a held-out dataset. However, such validation datasets often need to be large and are hard to procure; further, multiple models may perform equally well on such sets. To address these challenges, we offer GeValdi: an efficient method to validate discriminative classifiers by creating samples where such classifiers maximally differ. We demonstrate how such ``maximally different samples'' can be constructed via and leveraged to probe the failure mode of classifiers and offer a hierarchically-aware metric to further support fine-grained, comparative model evaluation.
0 Replies
Loading