What Data Difficulty Metrics Should We Measure for Tabular Deep Learning?

Jiwon Chang; Fatemeh Nargesian

What Data Difficulty Metrics Should We Measure for Tabular Deep Learning?

Jiwon Chang, Fatemeh Nargesian

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data difficulty, instance difficulty, model debugging, noisy label detection, data pruning

TL;DR: Empirical study of data difficulty for tabular deep learning and resultant three-factor model: difficulty with ground truth label, confidence, and influence/valuation.

Abstract: The notion of data difficulty has garnered attention in the machine learning community due to its wide-ranging applications, from noisy label detection to data debugging and pruning. Yet with many competing definitions, researchers and practitioners have often selected difficulty metrics in an ad hoc manner. Further, systematic evaluations have been limited to vision settings, and tabular DL presents unique challenges. To aid principled metric selection in tabular deep learning, we conduct a comprehensive empirical study of existing metrics, including logit-based, gradient-based, ensemble, valuation, and influence methods. By collecting difficulty scores across diverse model architectures, tasks, and epochs, we assemble a large-scale dataset for statistical analysis. We ask and answer the following questions: (1) How many orthogonal factors comprise data difficulty? (2) How many metrics and random seeds are needed to rank difficulty robustly? (3) Which metrics are most effective for noisy label detection? (4) Is the factor structure stable across subgroup splits? (5) How are early-training and late-training difficulty different? Our results contradict both the view that difficulty metrics are neither redundant nor hyper-specialized. Instead, we identify three consistent factors: label-aware difficulty, confidence, and influence/valuation. We show that measuring a computationally inexpensive exemplar of each factor captures most interpretable information, and that the three factors are strongly predictive of noisy labels. We further observe that confidence is more prominent in test data, whereas influence/valuation is more important in train data. Rank stability analysis shows that combining just two metrics, each measured over two random seeds, yields rankings that correlate strongly with the ground truth. Finally, we contribute an open-source Python library that streamlines the measurement of difficulty metrics from model snapshots.

Primary Area: interpretability and explainable AI

Submission Number: 9703

Loading