Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data quality; probabilistic model; multi-perspective analysis;
Abstract: A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50\% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81\% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. \textcolor{blue}{These analyses suggest that robustness is heavily influenced by two key principles}: the \textbf{richness of conditioning information}, which constrains the learning problem, and the \textbf{absolute information content} of the training data, which allows the signal from correct information to dominate statistical noise.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 24590
Loading