Model validation failure in class imbalance problems

Seokho Kang

2020 (modified: 04 Oct 2020)Expert Syst. Appl. 2020Readers: Everyone

Abstract: Highlights • Model validation is inherently difficult under class imbalance where minority class is rare in absolute sense. • Validation performance would misrepresent generalization ability of classification models. • Random guessing models can yield considerably high validation performance by chance. • Higher degree of absolute rarity contributes to increased likelihood of model validation failure. Abstract For a classification task, multiple classification models can be built from the training set in various ways. In general, the best-performing model is selected for deployment through a model validation procedure. However, even if the dataset is sufficiently large, model validation is difficult when the minority class is too rare in an absolute sense in the validation set. Under such an extreme absolute rarity condition, the validation performance of a model is more affected by randomness in the model so that it would misleadingly estimate the generalization ability of the model. In this regard, even a random guessing model, which will eventually fail to accurately classify new data, can yield a considerably high validation performance by chance. This implies that the selected model may not perform well during its deployment. In this study, the effect of absolute rarity on the inherent difficulty of model validation is investigated. We demonstrate that the higher degree of absolute rarity in the validation set as well as comparing a larger number of models during model validation contribute to an increased likelihood of model validation failure. Finally, a practical guideline is suggested to evaluate model validation results.

0 Replies