Minimizing Chebyshev Risk Magically Mitigates the Perils of Overfitting

Nathaniel Dean; Dilip Sarkar

Minimizing Chebyshev Risk Magically Mitigates the Perils of Overfitting

Nathaniel Dean, Dilip Sarkar

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: prototype, regularization, overfitting, Chebyshev

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Reduced overfitting with loss terms based on a Chebyshev inequality of within-class covariance and between-class prototype distance

Abstract: Since reducing overfitting in deep neural networks (DNNs) increases their test performance, many efforts have tried to mitigate it by adding regularization loss terms in one or more hidden layers of the network, including the convolutional layers. To build upon the canonical wisdom guiding these previous works, we analytically tried to understand how intra and inter-class feature relationships affect misclassification. Our analysis begins by assuming a DNN is the composition of a feature extractor and classifier, where the classifier is the last fully connected layer of the network and the feature layer is the input vector to the classifier. We assume that, corresponding to each class, there exists an ideal feature vector which we designate as a class prototype. The goal of the training method is then to reduce the probability that an example’s features deviate significantly from its class prototype, which increases the risk of misclassification. Formally, this probability can be bound using a Chebyshev’s inequality comprised of within-class covariance and between-class prototype distance. The terms in the inequality are added to our loss function for optimizing the feature layer, which implicitly optimizes the previous convolutional layers’ parameter values. We observe from empirical results on multiple datasets and network architectures that our training algorithm reduces overfitting and improves upon previous approaches in an efficient manner.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6766

Loading