Cross Validation for Correlated Data in Classification Models

Published: 22 Jan 2025, Last Modified: 06 Mar 2025AISTATS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Adaptation of the standard cross validation procedure in classification models, to achieve an unbiased estimator of the generalization error, in the case of non iid data-collection.
Abstract: We present a methodology for model evaluation and selection in binary classification models with the presence of correlations in the data, where the sampling mechanism violates the i.i.d. assumption. Our methodology involves a formulation of the bias term between the standard Cross-Validation (CV) estimator and the mean generalization error, and practical data-based procedures to estimate this term. Consequently, we present the bias-corrected CV estimator, which is the standard CV estimate supplemented by an estimate of the bias term. This concept was introduced in the literature only in the context of a linear model with squared error loss as the criterion for prediction performance. Our suggested bias-corrected CV estimator can be applied to any learning model, including deep neural networks, and to standard criteria for prediction performance for classification tasks, including misclassification rate, cross-entropy and hinge loss. We demonstrate the applicability of the proposed methodology in various scenarios where the data contains complex correlation structures (such as clustered and spatial relationships) with synthetic data and real-world datasets, providing evidence that the bias-corrected CV estimator is better than the standard CV estimator.
Submission Number: 584
Loading