Keywords: data quality; probabilistic model; multi-perspective analysis;
Abstract: A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models.
We find that autoregressive language models,
from token prediction to sequence-to-sequence tasks,
are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50\% token corruption).
By contrast, under the same levels of data corruption,
class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81\% relative to baseline),
while classifiers show a moderate impact that diminishes with dataset scale.
To explain these discrepancies,
we analyze the results through a multi-perspective lens,
integrating information theory,
PAC learning, and gradient dynamics.
\textcolor{blue}{These analyses suggest that robustness is heavily influenced by two key principles}:
the \textbf{richness of conditioning information}, which constrains the learning problem,
and the \textbf{absolute information content} of the training data, which allows the signal from correct information to dominate statistical noise.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 24590
Loading