DIME: An Information-Theoretic Difficulty Measure for AI DatasetsDownload PDF

19 Oct 2020 (modified: 21 Nov 2020)NeurIPS 2020 Workshop DL-IG Blind SubmissionReaders: Everyone
  • Keywords: Dataset understanding, Difficulty Measure, Information Theory, Fano's Inequality, Conditional Entropy
  • TL;DR: We design DIME, an empirical difficulty measure for datasets to characterize the intrinsic complexity of the sample-label distribution in supervised learning.
  • Abstract: Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we propose DIME, an information-theoretic DIfficulty MEasure for datasets, based on Fano’s inequality and a neural network estimation of the conditional entropy of the sample-label distribution. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.
3 Replies