Data Contradictions Are Uncertainty, Not Noise

Adhiraj Chhoda

Data Contradictions Are Uncertainty, Not Noise

Adhiraj Chhoda

Published: 03 Jun 2026, Last Modified: 11 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data quality, data contradictions, uncertainty quantification, consistent query answering, repair sensitivity, responsible AI, ML pipelines

TL;DR: Data contradictions in ML training sets should be treated as uncertainty to quantify, not noise to clean away.

Abstract: This position paper argues that data contradictions in ML training sets should be treated as uncertainty to be quantified, not noise to be cleaned away. The standard ML pipeline treats data quality as preprocessing: find inconsistencies, pick a repair, train on the result. This workflow silently discards information. When two hospital records disagree on a patient’s diagnosis, that disagreement reflects genuine ambiguity, and a model trained on one arbitrary resolution is overconfident in exactly the cases where it should be uncertain. We show that the multiplicity of valid data repairs maps naturally to prediction uncertainty: train on each repair, measure prediction disagreement, and the result is an informative confidence signal that requires no architectural changes and no Bayesian machinery. As an illustrative proof of concept on Adult Income, models trained on different repair strategies disagree with clean-baseline predictions on 2.3% of test instances, concentrated in the subgroups affected by the original contradiction. Systematic validation across datasets is future work. Of 101 data-cleaning-for-ML papers surveyed by Côté et al. (2024), zero treat repair non-uniqueness as an uncertainty signal. Of 51 tabular-ML papers at NeurIPS and ICML 2024–2025, not one engages with the 26-year database theory literature on consistent query answering. The field cleans when it should reason.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 476

Loading