- Keywords: Dataset understanding, Difficulty Measure, Information Theory, Fano's Inequality, Conditional Entropy
- TL;DR: We design DIME, an empirical difficulty measure for datasets to characterize the intrinsic complexity of the sample-label distribution in supervised learning.
- Abstract: Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we propose DIME, an information-theoretic DIfficulty MEasure for datasets, based on Fano’s inequality and a neural network estimation of the conditional entropy of the sample-label distribution. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.