Rethinking Benign Overfitting in Two-Layer Neural Networks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We re-examine benign overfitting in two-layer neural networks and prove that while data can be classified by explicit features, long-tailed data can also be classified base on implicit features learned from class-dependent noise.
Abstract: Recent theoretical studies (Kou et al., 2023; Cao et al., 2022) revealed a sharp phase transition from benign to harmful overfitting when the noise-to-feature ratio exceeds a threshold—a situation common in long-tailed data distributions where atypical data is prevalent. However, such harmful overfitting rarely happens in overparameterized neural networks. Further experimental results suggested that memorization is necessary for achieving near-optimal generalization error in long-tailed data distributions (Feldman & Zhang, 2020). We argue that this discrepancy between theoretical predictions and empirical observations arises because previous feature-noise data models overlook the heterogeneous nature of noise across different data classes. In this paper, we refine the feature-noise data model by incorporating class-dependent heterogeneous noise and re-examine the overfitting phenomenon in neural networks. Through a comprehensive analysis of the training dynamics, we establish test loss bounds for the refined model. Our findings reveal that neural networks can leverage "data noise" to learn implicit features that improve the classification accuracy for long-tailed data. Our analysis also provides a training-free metric for evaluating data influence on test performance. Experimental validation on both synthetic and real-world datasets supports our theoretical results.
Lay Summary: Scientists were puzzled by a contradiction in AI research. Theories predicted that when powerful AI models are trained on datasets with much noise, they should learn the random noise in the data and perform poorly—a problem called "harmful overfitting." However, in practice, this rarely happens. The researchers in this study argue that previous theories were missing a key detail: they assumed the "noise" in the data is the same for all categories. This paper suggests that the noise is different depending on the data class (e.g., the visual "noise" in pictures of rare birds is different from that in common dogs). By creating a more realistic model that includes this varied noise, they discovered that the AI doesn't just ignore the noise; it actually leverages it. The network learns hidden features from what appears to be random data noise, which in turn helps it get better at correctly identifying the rare items. As a practical result, the team also developed a new metric that can evaluate how much a piece of data will influence the AI's performance without needing to go through the entire training process. Their findings were confirmed on both computer-generated and real-world data.
Primary Area: Deep Learning->Theory
Keywords: Benign overfitting, long-tailed data, two-layer neural networks
Submission Number: 16273
Loading