Abstract: The presence of noisy annotations in large-scale facial expression datasets has been a key challenge to Facial expression recognition (FER) performance in the wild. Convolutional neural networks tend to fit clean data in the early stages, however, they memorize noisy labels as training progresses which is detrimental to performance. Our proposed architecture consists of a pair of classifier networks (CNs) and an Instance discrimination network (IDN) all built on top of a shared base network. An IDN is designed to identify each image instance in the training dataset. We inventively use the IDN as a technique to promote feature learning as well as regularize our model. We design a three phase training methodology to protect the model from overfitting noisy labels as training progresses while also utilising all the training samples. (1) A Warmup phase initially co-trains the pair of CNs and the shared base network purely based on supervision loss defined over all samples. (2) A Filtration phase partitions the training dataset into clarified and messy samples based on the predictions of the CNs and ground-truth. (3) A Noise robust training phase uses a joint loss consisting of supervision loss over clarified samples, consistency loss over messy samples and a loss contribution from IDN over all samples. We thus carefully avoid supervision over messy samples, instead the IDN supplements consistency loss in learning features from messy samples. We demonstrate the effectiveness of our method on standard FER benchmark datasets as well as on synthetic noisy label datasets. Codes are available at https://github.com/gnvikas/NoisyFER.
Loading