Early-stopped neural networks are consistent

Ziwei Ji; Justin D. Li; Matus Telgarsky

Early-stopped neural networks are consistent

Ziwei Ji, Justin D. Li, Matus Telgarsky

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 SpotlightReaders: Everyone

Keywords: Neural Networks, Deep Networks, calibration, consistency, nonseparable, gradient descent

TL;DR: For general classification problems, including those with noise, gradient descent with early stopping on shallow ReLU networks achieves the optimal risk amongst all measurable predictors

Abstract: This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any classifier satisfying a basic local interpolation property is inconsistent.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

13 Replies

Loading