Improving Evaluation of Facial Attribute Prediction Models

Bryson Lingenfelter, Emily M. Hand

2021 (modified: 05 Oct 2022)FG 2021Readers: Everyone

Abstract: CelebA is the most common and largest scale dataset used to evaluate methods for facial attribute prediction, an important benchmark in imbalanced classification and face analysis. However, we argue that the evaluation metrics and baseline models currently used to compare the performance of different methods are insufficient for determining which approaches are best at classifying highly imbalanced attributes. We are able to obtain results comparable to current state-of-the-art using a ResNet-18 model trained with binary cross-entropy, a substantially less sophisticated approach than related work. We also show that we can obtain near-state-of-the-art results on accuracy using a model trained with just 10% of CelebA, and on balanced accuracy simply by maximizing recall for imbalanced attributes at the expense of all other metrics. To deal with these issues, we suggest several improvements to model evaluation including better metrics, stronger baselines, and increased awareness of the limitations of the dataset.

0 Replies