DIME: An Information-Theoretic Difficulty Measure for AI DatasetsDownload PDF

Published: 07 Nov 2020, Last Modified: 05 May 2023NeurIPSW 2020: DL-IG PosterReaders: Everyone
Keywords: Dataset understanding, Difficulty Measure, Information Theory, Fano's Inequality, Conditional Entropy
TL;DR: We design DIME, an empirical difficulty measure for datasets to characterize the intrinsic complexity of the sample-label distribution in supervised learning.
Abstract: Evaluating the relative difficulty of widely-used benchmark datasets across time and across data modalities is important for accurately measuring progress in machine learning. To help tackle this problem, we propose DIME, an information-theoretic DIfficulty MEasure for datasets, based on Fano’s inequality and a neural network estimation of the conditional entropy of the sample-label distribution. DIME can be decomposed into components attributable to the data distribution and the number of samples. DIME can also compute per-class difficulty scores. Through extensive experiments on both vision and language datasets, we show that DIME is well aligned with empirically observed performance of state-of-the-art machine learning models. We hope that DIME can aid future dataset design and model-training strategies.
3 Replies