Keywords: compression, generalization, overparameterized models, boosting, information theory, model complexity
TL;DR: We discuss the properties of models that capture all the information from the features relevant for predicting the target and no more.
Abstract: Successful learning algorithms like DNNs, kernel methods or ensemble learning methods, have been known to produce models that exhibit good generalization despite being drawn from overparameterized model families. This observation has put in question the convex relationship between model complexity and generalization. We instead propose rethinking the relevant notion of model complexity for the purposes of assessing the complexity of models trained on a given dataset. Borrowing from information theory, we identify the optimal model one can train on a given dataset as one achieving its lossless maximal compression. In the noiseless dataset setting, it can be shown that such a model coincides with an average margin maximizer of the training data. Experimental results on gradient boosting confirm our observations and show that the minimal generalization error is attained in expectation by models achieving lossless maximal compression of the training data.
1 Reply
Loading